Introduction to the Crawler Python Toolkit (4)

This article is from netease Cloud community

Author: Wang Tao

Here we give some examples of common code, including GET, POST (JSON, forms), access with certificates: GET request

@gen.coroutine
def fetch_url():
    try:
        c = CurlAsyncHTTPClient()  Define an HttpClient
        myheaders = {
            "Host": "weixin.sogou.com"."Connection": "keep-alive"."Cache-Control": "max-age=0"."Upgrade-Insecure-Requests": "1"."User-Agent": "Mozilla / 5.0 (Windows; U; Windows NT 6.1; En-us) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.0 Safari/532.5"."Accept": "text/html,application/xhtml+xml,application/xml; Q = 0.9, image/webp image/apng, * / *; Q = 0.8"."Accept-Encoding": "gzip, deflate"."Accept-Language": "zh-CN,zh; Q = 0.9, en. Q = 0.8"
        }
        url = "http://weixin.sogou.com/weixin?type=1&s_from=input&query=%E4%BA%BA%E6%B0%91%E6%97%A5%E6%8A%A5&ie=utf8&_sug_=n&_sug_type _ ="

        req = HTTPRequest(url=url, method="GET", headers=myheaders, follow_redirects=True, request_timeout=20, connect_timeout=10,
                          proxy_host="127.0.0.1",
                          proxy_port=8888)
        response = yield c.fetch(req)  Make a request
        print response.code
        print response.body
        IOLoop.current().stop()  Stop the ioloop thread
    except:
        print traceback.format_exc()Copy the code

Packet request header caught by Fiddler:

POST JSON data request

@gen.coroutine
def fetch_url():
    """Grab url"""
    try:
        c = CurlAsyncHTTPClient()  Define an HttpClient
        myheaders = {
            "Host": "weixin.sogou.com"."Connection": "keep-alive"."Cache-Control": "max-age=0"."Upgrade-Insecure-Requests": "1"."User-Agent": "Mozilla / 5.0 (Windows; U; Windows NT 6.1; En-us) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.0 Safari/532.5"."Accept": "text/html,application/xhtml+xml,application/xml; Q = 0.9, image/webp image/apng, * / *; Q = 0.8"."Accept-Encoding": "gzip, deflate"."Content-Type": "Application/json"."Accept-Language": "zh-CN,zh; Q = 0.9, en. Q = 0.8"
        }
        url = "Http://127.0.0.1? type=1&s_from=input&query=%E4%BA%BA%E6%B0%91%E6%97%A5%E6%8A%A5&ie=utf8&_sug_=n&_sug_type_="
        body =json.dumps({"key1": "value1"."key2": "value2"})  # Json format data

        req = HTTPRequest(url=url, method="POST", headers=myheaders, follow_redirects=True, request_timeout=20, connect_timeout=10,
                          proxy_host="127.0.0.1",proxy_port=8888,body=body)
        response = yield c.fetch(req)  Make a request
        print response.code
        print response.body
        IOLoop.current().stop()  Stop the ioloop thread
    except:
        print traceback.format_exc()Copy the code

Packet request header caught by Fiddler:

POST Form data request

@gen.coroutine
def fetch_url():
    """Grab url"""
    try:
        c = CurlAsyncHTTPClient()  Define an HttpClient
        myheaders = {
            "Host": "weixin.sogou.com"."Connection": "keep-alive"."Cache-Control": "max-age=0"."Upgrade-Insecure-Requests": "1"."User-Agent": "Mozilla / 5.0 (Windows; U; Windows NT 6.1; En-us) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.0 Safari/532.5"."Accept": "text/html,application/xhtml+xml,application/xml; Q = 0.9, image/webp image/apng, * / *; Q = 0.8"."Accept-Encoding": "gzip, deflate".# "Content-Type": "Application/json",
            "Accept-Language": "zh-CN,zh; Q = 0.9, en. Q = 0.8"
        }
        import urllib
        url = "Http://127.0.0.1? type=1&s_from=input&query=%E4%BA%BA%E6%B0%91%E6%97%A5%E6%8A%A5&ie=utf8&_sug_=n&_sug_type_="
        body =urllib.urlencode({"key1": "value1"."key2": "value2"})  Wrap the form form

        req = HTTPRequest(url=url, method="POST", headers=myheaders, follow_redirects=True, request_timeout=20, connect_timeout=10,
                          proxy_host="127.0.0.1",proxy_port=8888,body=body)
        response = yield c.fetch(req)  Make a request
        print response.code
        print response.body
        IOLoop.current().stop()  Stop the ioloop thread
    except:
        print traceback.format_exc()Copy the code

Packet request header caught by Fiddler:

Adding certificate Access

def fetch_url():
    """Grab url"""
    try:
        c = CurlAsyncHTTPClient()  Define an HttpClient
        myheaders = {
            "Host": "www.amazon.com"."Connection": "keep-alive"."Cache-Control": "max-age=0"."Upgrade-Insecure-Requests": "1"."User-Agent": ("Mozilla / 5.0 (Windows NT 10.0; Win64; x64) "
                   AppleWebKit/537.36 (KHTML, like Gecko)
                   "Chrome / 68.0.3440.106 Safari / 537.36"),
            "Accept": ("text/html,application/xhtml+xml,"
               "application/xml; Q = 0.9, image/webp image/apng, * / *; Q = 0.8"),
            "Accept-Encoding": "gzip, deflate, br"."Accept-Language": "zh-CN,zh; Q = 0.9, en. Q = 0.8"
        }
        import urllib
        url = "https://www.amazon.com/"

        req = HTTPRequest(url=url, method="GET", headers=myheaders, follow_redirects=True, request_timeout=20, connect_timeout=10,proxy_host="127.0.0.1",
        proxy_port=8888,ca_certs="FiddlerRoot.pem")  # bind certificate
        response = yield c.fetch(req)  Make a request
        print response.code
        print response.body
        IOLoop.current().stop()  Stop the ioloop thread
    except:
        print traceback.format_exc()Copy the code

Packet caught by Fiddler (indicates normal access)

Four,

Use Requests when fetching is low. Simple and easy to use. Tornado is recommended to be used when there is a large amount of concurrency, which is highly efficient and easy to program with single thread.

The interfaces and parameters used in Requests and Fiddler are described above, which can solve most of the problems faced by crawlers, including concurrent fetching, everyday anti-crawling, and HTTPS web fetching.

Here’s a snippet of my own common fetching logic:

import randomfrom tornado.ioloop import IOLoopfrom tornado import genfrom tornado.queues import Queue import random from  tornado.ioloop import IOLoop from tornado import gen from tornado.queues import Queue TASK_QUE = Queue(maxsize=1000) def response_handler(res):"""Processing the reply typically adds the parsed new URL to the task queue and parses the target data."""
    pass


@gen.coroutine
def url_fetcher_without_param():
    pass


@gen.coroutine
def url_fetcher(*args,**kwargs):
    global TASK_QUE
    c = CurlAsyncHTTPClient()

    while 1:
        #console_show_log("Let's spider")
        try: 
            param = TASK_QUE.get(time.time() + 300) # 5 minutes overtimeExcept, tornado. Util. TimeoutError: : yield gen. Sleep (random. Randint (10100))continue

        try:
            req = HTTPRequest(url,method=,headers=,....) Configure parameters as needed
            response = yield c.fetch(req) 
            if response.coe==200:
                response_handler(response.body)
        except Exception:
            yield gen.sleep(10)
            continue
        finally:
            print "I am a slow spider"Yield gen. Sleep (random. Randint (10,100)) @gen. Coroutine def period_callback(): pass def main(): io_loop = IOLoop.current()Add concurrency logic 1
    io_loop.spawn_callback(url_fetcher, 1)  
    io_loop.spawn_callback(url_fetcher, 2)
    io_loop.spawn_callback(url_fetcher_without_param) The parameter is optional

    If periodic calls are required, call PeriodicCallback:
    PERIOD_CALLBACK_MILSEC = 10  # 10, in ms
    io_loop.PeriodicCallback(period_callback,).start()
    io_loop.start()

if __name__ == "__main__":
    main()Copy the code

Above, welcome to discuss and exchange

Five, reference:

Requests Quick Start: docs.python-requests.org/zh_CN/lates…
Requests Advanced applications: docs.python-requests.org/en/master/u…
What is a CA_BUNDLE:www.namecheap.com/support/kno…
How to use the requests to download images: stackoverflow.com/questions/1…
Tornado AsyncHttpClient: www.tornadoweb.org/en/stable/h…
100 Continue status code: developer.mozilla.org/zh-CN/docs/…
HTTP authentication: developer.mozilla.org/en-US/docs/…
For certificate: www.alibabacloud.com/help/zh/faq…

Netease Cloud Free experience pavilion, 0 cost experience 20+ cloud products!

For more information about NETEASE’s r&d, product and operation experience, please visit netease Cloud Community.

【 Recommendation 】 The security status of traditional character authentication and the advantages of netease Easy Shield captcha

Introduction to the Crawler Python Toolkit (4)

Four,

Five, reference:

Related Posts

7 hidden Chrome DevTools features to boost your Productivity

Git workflow solution for easy Code review

How to use the Go module