Introduction to the Crawler Python Toolkit (2)

This article is from NetEase Cloud community

Author: Wang Tao

The optional parameters are introduced one by one:

parameter	paraphrase	The sample
params	Generate url? Key=value	Example 1: `>>>payload = {'key1': 'value1'.'key2': 'value2'} >>>r = requests.get("http://httpbin.org/get", params=payload)print(r.url)http://httpbin.org/get?key2=value2&key1=value1Copy the code` Example 2: `>>> param = 'httpparams' >>> r = requests.get("http://httpbin.org/get",params=param) >>> print r.urlhttp://httpbin.org/get?httpparamsCopy the code`
data	Supports dictionaries, lists, and strings. The POST method is used to simulate an HTML form	Example 1: `>>> payload = {'key1': 'value1'.'key2': 'value2'} >>> r = requests.post("http://httpbin.org/post", data=payload) >>> print(r.text) { ... "form": { "key2": "value2"."key1": "value1"},... }Copy the code` Example 2: `>>> payload = (('key1'.'value1'), ('key1'.'value2')) >>> r = requests.post('http://httpbin.org/post', data=payload) >>> print(r.text) { ... "form": { "key1": [ "value1"."value2"]},... }Copy the code`
json	Post is used to pass JSON data to the server, Many Ajax requests are passed JSON	Example 1: `r = requests.post(url, json={"key":"value"}})Copy the code` The captured header is content-Type: application/json
headers	A custom HTTP header that is merged with the request’s own default header as the request header. Note: All header values must be String, byteString, or Unicode.	Example 1: `r = requests.post(url,headers={"user-agent":"test"})Copy the code`
cookies	You can access the cookie in the reply through cookies, You can also send custom cookies to the server. With custom cookie objects, you can also specify properties such as valid fields	Example 1: Get the cookie in the reply `>>> url ='http://example.com/some/cookie/setting/url' >>> r = requests.get(url) >>>r.cookies['example_cookie_name']Copy the code` Example 2: Sending cookies to the server `>>> jar = requests.cookies.RequestsCookieJar() >>> jar.set('tasty_cookie'.'yum', domain='httpbin.org', path='/cookies') >>> url = ' ' >>> r = requests.get(url, cookies=jar) >>> r.text '{"cookies": {"tasty_cookie": "yum"}}'Copy the code`
files	Upload a multipart-encoded file	Example 1: Upload a file `>>> url = 'http://httpbin.org/post' >>> files = {'file': open('report.xls'.'rb')} >>> r = requests.post(url, files=files)Copy the code` Example 2: Explicitly set the file name, file type, and request header `>>> url = 'http://httpbin.org/post' >>> files = {'file': ('report.xls', open('report.xls'.'rb'), 'application/vnd.ms-excel', {'Expires': '0'})} >>> r = requests.post(url, files=files)Copy the code`
auth	Support multiple HTTPBasicAuth HTTPDigestAuth/HTTPProxyAuth kind of certification	Example 1 `>>>url = 'http://httpbin.org/digest-auth/auth/user/pass' >>>requests.get(url, auth=HTTPDigestAuth('user'.'pass'))Copy the code` After packet capture, the HTTP header is as follows: `GET http://httpbin.org/digest-auth/auth/user/pass HTTP / 1.1 Host:httpbin.org Connection: keep alive - Accept - Encoding: Gzip, deflate the Accept: / the user-agent: python - requests / 2.12.4 Authorization: Basic dXNlcjpwYXNzCopy the code`
timeout	1 float: indicates the connection waiting time. This parameter is valid only for socket connection. 2 tuple:(connect timeout, read timeout)	Example 1: `>>> requests.get('http://github.com', timeout=0.001) Traceback (most recent call last): File "", line 1, inrequests.exceptions.Timeout: HTTPConnectionPool(host='github.com', port=80): Request timed out. (timeout=0.001)Copy the code`
allow_redirects	If you use GET, OPTIONS, POST, PUT, PATCH, or DELETE, You can disable redirect processing with the allow_redirects parameter. Note: In the crawler process, we need to disable the jump in some scenarios and set it to False. The default True	Example 1: When 3xx is received, the system automatically jumps to: `>>>r = requests.get('http://github.com', allow_redirects=False) >>> r.status_code 301 >>> r.history []Copy the code`
proxies	Configuring proxy information, nothing to say. Configure HTTP and HTTPS proxy as required	Example 1: `>>>proxies = { "http": "http://127.0.0.1:8888"."https": "https://127.0.0.1:8888", } >>>r=requests.get(url, headers=headers,proxies=proxies)Copy the code`
stream	Set stream to true if the reply is streaming. Note: Active shutdown is required, otherwise the connection will not be released	Example 1: Download baidu pictures: import requests from contextlib import closing def download_image_improve(): url = (' ' 'image&quality=80&size=b9999_10000' '&sec=1504068152047&di=8b53bf6b8e5deb64c8ac726e260091aa&imgtype=0' '&src=http%3A%2F%2Fpic.baike.soso.com%2Fp%2F' '20140415%2Fbki-20140415104220-671149140.jpg') with closing(requests.get(url, stream=True,verify=False)) as response: # open an empty PNG file, equivalent to creating an empty TXT file, # wb indicates write file with open('selenium1.png'.'wb') as file: This file is finally written to selen.png file.write(data)Copy the code
verify	The default value is True to verify the server certificate. If it is a string, it is the path of CA_BUNDLE, which is the certificate path. If you can’t find the certificate, use PEM from Fiddler as in Example 2, or install the Certifi package (which comes with a set of trusted root certificates for Requests).	Example 1: Disabling certificate validation (recommended) `r=requests.get(url, verify=False)Copy the code` Example 2: Borrow Fiddler’s converted PEM certificate to access Central Asia. Fiddlerroot.zip download address: nos.netease.com/knowledge/ 2b99aacb-e9bf-42f7-8edf-0f8ca0326533? download=FiddlerRoot.zip Note: The certificate can be in cer or PEM format. You are advised to install a cer certificate first. You can also install Fiddler yourself, trust the Fiddler certificate, Export the cer format and use it again Cer -out specifies the format of the fiddlerroot. pem conversion certificate headers = { "Host": "www.amazon.com"."Connection": "keep-alive"."Cache-Control": "max-age=0"."Upgrade-Insecure-Requests": "1"."User-Agent": ("Mozilla / 5.0 (Windows NT 10.0; Win64; x64) " AppleWebKit/537.36 (KHTML, like Gecko) "Chrome / 68.0.3440.106 Safari / 537.36"), "Accept": ("text/html,application/xhtml+xml," "application/xml; Q = 0.9, image/webp image/apng, /; Q = 0.8"), "Accept-Encoding": "gzip, deflate, br"."Accept-Language": "zh-CN,zh; Q = 0.9, en. Q = 0.8" } print requests.get('https://www.amazon.com/', verify=r"FiddlerRoot.pem", headers=headers).contentCopy the code
cert	Type: String: Represents the cert file of the SSL client (.pem) file path tuple :(‘cert’,’key’),verify given the server certificate, cert given is the client certificate (for HTTPS bidirectional authentication)	This field has not been tested and has not yet been used. If you’re interested, you can look into it

Key functions and parameters in Tornado

Tornado Non-blocking HttpClient

Tornado has two non-blocking implementations of HttpClient, SimpleAsyncHTTPClient and CurlAsyncHTTPClient. You can call them the base class for AsyncHTTPClient, through AsyncHTTPClient. The configure method which one to choose to use the above implementation, or directly instantiate any of the above a subclass. The default is SimpleAsyncHTTPClient, which already meets the needs of most users, but we chose CurlAsyncHTTPClient with more advantages.

CurlAsyncHTTPClient supports more features, such as proxy Settings, specifying network outgoing interfaces, and so on
CurlAsyncHTTPClient is also accessible for sites that are not very compatible with HTTP,
CurlAsyncHTTPClient faster
Before Tornado 2.0, CurlAsyncHTTPClient was the default.

2 introduction of Tornado key functions and parameters

Sample code (similar to the previous) :

@gen.coroutinedef fetch_url(url):
    """Grab url"""
    try:
        c = CurlAsyncHTTPClient()  Define an HttpClient
        req = HTTPRequest(url=url)  Define a request
        response = yield c.fetch(req)  Make a request
        print response.body
        IOLoop.current().stop()  Stop the ioloop thread
    except:        print traceback.format_exc()Copy the code

As you can see, this httpClient is also very easy to use. Create an HTTPRequest and call the Fetch method of HTTPClient to initiate the request. Let’s take a look at HTTPRequest’s definition and see what key parameters we need to know.

class HTTPRequest(object):
    """HTTP client request object."""

    # Default values for HTTPRequest parameters.
    # Merged with the values on the request object by AsyncHTTPClient
    # implementations._DEFAULTS = dict(connect_timeout=20.0, request_timeout=20.0, follow_redirects=True, max_redirects=5, decompress_response=True, proxy_password=' ',
        allow_nonstandard_methods=False,
        validate_cert=True)

    def __init__(self, url, method="GET", headers=None, body=None,
                 auth_username=None, auth_password=None, auth_mode=None,
                 connect_timeout=None, request_timeout=None,
                 if_modified_since=None, follow_redirects=None,
                 max_redirects=None, user_agent=None, use_gzip=None,
                 network_interface=None, streaming_callback=None,
                 header_callback=None, prepare_curl_callback=None,
                 proxy_host=None, proxy_port=None, proxy_username=None,
                 proxy_password=None, proxy_auth_mode=None,
                 allow_nonstandard_methods=None, validate_cert=None,
                 ca_certs=None, allow_ipv6=None, client_key=None,
                 client_cert=None, body_producer=None,
                 expect_100_continue=False, decompress_response=None,
                 ssl_options=None):
        r"""All parameters except ``url`` are optional.Copy the code

NetEase Cloud Free experience pavilion, 0 cost experience 20+ cloud products!

For more information about NetEase’s r&d, product and operation experience, please visit NetEase Cloud Community.

Relevant article: “recommended” to know things by learning | see, following instead the Android applications and protective techniques

Introduction to the Crawler Python Toolkit (2)

Key functions and parameters in Tornado

Tornado Non-blocking HttpClient

2 introduction of Tornado key functions and parameters

Related Posts

Multithreaded small demo

GitHub use experience – warehouse creation

CI Weekly # 20 | from the continuous integration point of view the value of the “cloud”