Python3 crawler

Python3 development environment configuration

Request library Installation

Python libraries are needed to implement HTTP request operations, such as Requests, Selenium, Aiotttp, etc.

Selenium is an automated testing tool that drives the browser to perform specific actions.
ChromeDriver Drives Chrome to perform operations. Run it in the chromedriver directory.

Sudo mv chromedriver /usr/bin failed to modify system integrity configuration. This tool needs to be executed from the Recovery OS. In this case, power off the device and then power it on. Press the Command + R key to enter the recovery mode. Run csrutil disable on the terminal to disable SIP.

from selenium import webdriver
browser = webdriver.Chrome()
Copy the code

After running, a blank Chrome browser pops up. 3. Aiohttp is a library that provides asynchronous Web services. Aiohttp asynchronous operations can use the async/await keyword. 5. BeautifulSoup installation, which relies on the LXML library for HTML and XML parsers. Pyquery is also a powerful web page parsing tool that provides a syntax similar to jQuery for parsing HTML documents and supports CSS selectors.

Install the database 1. Mysql lightweight relational database, storing data in the form of tables 2.MongDB is a non-relational database written in C++. 3.Redis is an efficient non-relational database based on memory

brew install redis
Copy the code

Start the Redis server

brew services start redis
redis-server /usr/localConf // Stop and restart brew services stop redis Brew services restart redisCopy the code

4.PyMySQL PYTHon3 requires PyMySQL to store data in mysql.

pip3 install pymysql
Copy the code

Verify the installation

>>> import pymysql
>>> pymysql.VERSION
(0, 9, 2, None)
>>>
Copy the code

The crawler frame

##### i. PySplider powerful web crawler framework

pip3 install pyspider
Copy the code

Python3.7 generates the following error:

Traceback (most recent call last):
  File "/usr/local/bin/pyspider", line 6, in <module>
    from pyspider.run import main
  File "/ usr/local/lib/python3.7 / site - packages/pyspider/run. The p y", line 231
    async=True, get_object=False, no_input=False):
        ^
SyntaxError: invalid syntax
Copy the code

The reason is that async cannot be used as a parameter name since PYTHon3.7

# # # # # 2. Scrapy

pip3 install Scrapy
Copy the code

#### 3. ScrapySplash is a javascript-enabled rendering tool for Scrapy. The installation of this tool is divided into two parts: one is the installation of the Splash service, which is installed in Docker mode. You can load JavaScript pages through its interface. The other is the Python library installation of ScrapySplsh, which allows you to use the Slpash service in your Scrapy.

Install the Splash

docker run -p 8050:8050 scrapinghub/splash
Copy the code

The installation of ScrapySplash

pip3 install scrapy-splash
Copy the code

#### 4. Installation of ScrapyRedis

ScrapyRedis is a distributed extension module of Scrapy, which can implement the construction of the crawler.

pip3 install scrapy-redis
Copy the code

Install the test

$ python3
>>> import scrapy_redis
Copy the code

#### 5. Install related libraries

1. The installation of the Docker

brew cask install docker
Copy the code

2. Scrapyd installation

Scraypd is a tool for deploying and running Scrapy projects that can be uploaded to a cloud host and controlled through an API.

pip3 install scrapyd
Copy the code

3. ScrapydClient installation

pip3 install scrapyd-client
Copy the code

A successful installation is followed by a deployment command called scrapyd-deploy.

scrapyd-deploy -h
Copy the code

4. ScrapydAPI installation

pip3 install python-scrapyd-api
Copy the code

The API can be directly requested to obtain the current host Scrapy task health.

from scrapyd_api import ScrapydAPI
scrapyd = ScrapydAPI('http://localhost:6800')
print(scrapyd.list_projects())
Copy the code

Get the state of each host Scrapy task directly in Python.

5.Scrapyrt provides an HTTP interface to schedule Scrapy tasks without requiring a single HTTP interface.

pip3 install scrapyrt
Copy the code

The crawler based

Using a proxy can successfully disguise the IP address and prevent the local IP address from being blocked due to too many requests per unit of time.

Classification of the agent

FTP proxy SERVER HTTP proxy server SSL/TLS proxy, used to access encrypted websites Telnet proxy, used by realPlayers to access Real streaming media servers, usually with caching function. The POP3/SMTP proxy is used to send and receive mails in POP3/SMTP mode and generally has the caching function. The port is 110/25 SOCKS proxy. It only transmits data packets and does not care about the specific protocol and usage

Use of basic libraries

Urllib uses urllib’s request module to initiate network requests

import ssl
import urllib.request
import urllib.parse

ssl._create_default_https_context = ssl._create_unverified_context
data = bytes(urllib.parse.urlencode({'word': 'hello'}), encoding='utf8')
response = urllib.request.urlopen('http://httpbin.org/post', data= data)
print(response.read().decode('utf-8'))
print(type(response))
Copy the code

The SSL library needs to be imported in python3. Where the address ‘httpbin.org/post’ is available…

Urlopen () can only make simple network requests. If you need to set more headers, you need to use Request

Page certification

The HTTPBasicAuthHandler can be used to complete the authentication. The HTTPBasicAuthHandler can be used to complete the authentication

from urllib.request import HTTPPasswordMgrWithDefaultRealm, HTTPBasicAuthHandler, build_opener
from  urllib.error import URLError

username = 'username'
password = 'password'

url = 'http://127.0.0.1:8000/06.%E7%9B%92%E5%AD%90%E6%A8%A1%E5%9E%8B.html'
p = HTTPPasswordMgrWithDefaultRealm()
p.add_password(None, url, username, password)
auth_handler = HTTPBasicAuthHandler(p)
opener = build_opener(auth_handler)

try:
    result = opener.open(url)
    html = result.read().decode('utf-8')
    print(html)
except URLError as e:
    print(e.reason)
Copy the code

Analyze the RobotParser protocol

Robots protocol, also known as crawler protocol and robot protocol, is used to tell crawlers and search engines which pages can be captured. It usually has a robots.txt text file, which is placed in the root directory of the website. Robots.txt example:

User-agent: * 
Disallow: /
Allow: /public/
Copy the code

User-agent describes the name of the search crawler. Disallow specifies a directory that does not allow fetching. The above example does not allow fetching all pages. Allow indicates pages that can be fetched.

The RobotParser module provides a RobotFileParser that determines whether a crawler has permission to crawl a web page based on the robots.txt file.

Fetching binary data

Images are binary data and are bytes

import requests

r = requests.get('https://github.com/favicon.ico')
with open('favicon.ico'.'wb') as f:
    f.write(r.content)

Copy the code

The open() method takes the first argument to the file name and the second argument to open it as a binary write that can write binary data to the file and save it.

Note: To build a Web framework, HERE I directly use python framework library to build a local server for local network requests

There are many good Web frameworks in Python development, such as Django, Tornado, Flask

Flask is a microframework for small applications with simple needs. Django includes an ORM, all-purpose Web framework out of the box. Tornado Facebook’s open source asynchronous Web framework

I built a Web framework with Tornado for local web requests and wrote a GET and POST request.

from tornado.web import Application, RequestHandler, url
from tornado.ioloop import IOLoop
from tornado.httpserver import HTTPServer
import tornado.options
import json

tornado.options.define('port', default=9000, type=int, help='this is the port >for application ')

class RegistHandler(RequestHandler):
    """docstring for RegisterHandler"""

    def initialize(self, title):
        self.title = title

    def set_default_headers(self):
        self.set_header("Content-type"."application/json; charset=utf-8")
        self.set_header("js"."zj")

    def set_default_cookie(self):
        self.set_cookie("loginuser"."admin")
        print(self.get_cookie("loginuser"))

    def get(self):
        username = self.get_query_argument('username')
        self.set_default_headers()
        self.set_default_cookie()
        self.write({' Jian Shu ':' Zhi ji '}")

    def post(self):
        self.set_default_headers()
        # username = self.get_argument('username')

        with open('app.json'.'r') as f:
            data = json.load(f)
            self.write(data)
            
 if __name__ == '__main__':
    Create an application object
    # app = tornado.web.Application([(r'/',IndexHandler)])
    Bind a listener port
    # app.listen(9000)
    Start the web application and start listening for port connections
    # tornado.ioloop.IOLoop.current().start()

    app = Application(
        [(r'/', IndexHandler),
         (r'/regsit', RegistHandler, {'title': 'Membership Registration'})], debug=True)
    tornado.options.parse_command_line()
    http_server = HTTPServer(app)
    print(IOLoop.current())
    # The original way
    http_server.bind(tornado.options.options.port)
    http_server.start(1)

    # enable Ioloop round-robin listener
    IOLoop.current().start()

Copy the code

Session to maintain

Using methods such as GET () or POST () directly in Requests can simulate network Requests, but each request is equivalent to a different Session.

import requests

s = requests.Session()
s.get('http://httpbin.org/cookies/set/number/123456789')
r = s.get('http://httpbin.org/cookies')
print(r.text)
Copy the code

Session allows you to simulate the same Session without worrying about Cookies.

### Use of the XPath parsing library

expression	describe
nodename	Selects all children of this node

/ | child nodes directly from the current node selection / / | select all descendants from the current node. | select the current node.. | select the parent node of the current node @ | select propertiesCopy the code

from lxml import etree
html = etree.parse('/Users/Cathy/Desktop/test.html', etree.HTMLParser())
result = html.xpath('//li/a')
print(result)
Copy the code

Test.html = test.html

<! DOCTYPE html> <html lang="en">
<body>
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a>
     </ul>
 </div>
</body>
</html>
Copy the code

Gets its parent, and then gets its class property

from lxml import etree
html = etree.parse('/Users/Cathy/Desktop/test.html', etree.HTMLParser())
result = html.xpath('//a[@href="link4.html"]/.. /@class')
print(result)
Copy the code

Attribute matching can be filtered by @ symbol. Here is the LI node whose class is item-1.

from lxml import etree
html = etree.parse('/Users/Cathy/Desktop/test.html', etree.HTMLParser())
result = html.xpath('//li[@class="item-0"]')
print(result)
Copy the code

The result of the match has two more elements:

[<Element li at 0x109b66dc8>, <Element li at 0x109b66e08>]
Copy the code

Text retrieval

Use the text() method to get the text in the node. The text here is in A. If you use Li directly, you cannot get the text.

from lxml import etree
html = etree.parse('/Users/Cathy/Desktop/test.html', etree.HTMLParser())
result = html.xpath('//li[@class="item-0"]/a/text()')
print(result)
Copy the code

PyQuery

html = ' '' 
       
        
        first item 
        second item 
        third item 
        fourth item 
        fifth item 
        
      
 '' '
Copy the code

Initialize a PyQuery object and pass in a CSS selector. #container. List Li means to select all li nodes inside the node whose ID is container and whose class is list.

from pyquery import PyQuery as pq
doc = pq(html)
li = doc('#container .list li')
print(li)
Copy the code

Nodes can also be found using the find() method, which selects all nodes that match the criteria.

from pyquery import PyQuery as pq
doc = pq(html)
items = doc('.list')
print(items.find('li'))
Copy the code

Methods such as children() and parent() are also provided to obtain the children and parents of a node.

After extracting a node, it is time to get the information contained in the node, such as getting attributes and text.

###PyMySQL

You need to specify an additional DB when connecting to mysql

import pymysql

# building table
db = pymysql.connect(host='localhost', user='root', password='123456', port=3306, db='spiders')
cursor = db.cursor()
sql = 'CREATE TABLE IF NOT EXISTS students (id VARCHAR(255) NOT NULL, name VARCHAR(255) NOT NULL, age INT NOT NULL, PRIMARY KEY (id))'
cursor.execute(sql)
db.close()

Copy the code

Insert data into a table

# insert data
id = '201200003'
user = 'Wang'
age = 11

sql = 'INSERT INTO students(id, name, age) values(%s, %s, %s)'
try:
    cursor.execute(sql, (id, user, age))
    db.commit()
    print('ok')
except:
    db.rollback() 
    print('error')
db.close()

Copy the code

Rollback () performs data rollback if an exception occurs

# query data

sql = 'SELECT * FROM students'
try:
    cursor.execute(sql)
    print('count:', cursor.rowcount) Get the number of data query results
    one = cursor.fetchone() The offset pointer points to the next set of data
    print('one:', one)
    results = cursor.fetchall()
    print('results:', results)
except:
    print('error')
Copy the code

Ajax

Ajax is Asynchronous JavaScript andXML. Ajax is not a programming language, but a technique that uses JavaScript to exchange data with the server and update parts of a web page without refreshing the page or changing the page link

Check for Headers in the link m.wibo.cn/that starts With getIndex. If the request has a message x-requested-with :XMLHttpRequest, it means Ajax.

The response body returns all of the render data for the page, which the JavaScript receives before executing the render method. The data displayed on the page we see is not returned by the original page, but by executing JavaScript and sending Ajax requests to the background again to get the data for further rendering. When Chrome checks web requests, XHR can screen out Ajax requests

Ajax result extraction

from urllib.parse import urlencode
from pyquery import PyQuery
import requests

base_url = 'https://m.weibo.cn/api/container/getIndex?'
headers = {
'Host': 'm.weibo.cn'.'Referer': 'https://m.weibo.cn/u/2145291155'.'User-Agent': 'the Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'.'X-Requested-With': 'XMLHttpRequest',
}

def get_page(page):
    params = {
        'type': 'uid'.'value': '2145291155'.'containerid': '1076032145291155'.'page': page
    }

    url = base_url + urlencode(params)
    try:
        response = requests.get(url=url,headers=headers)
        if response.status_code == 200:
            return response.json()
    except requests.ConnectionError as e:
        print('Error', e.args)

def parse_page(json):
    if json:
        items = json.get('data').get('cards')
        print(items)
        for item in items:
            item = item.get('mblog')
            weibo = {}
            weibo['id'] = item.get('id')
            weibo['text'] = PyQuery(item.get('text')).text()
            weibo['attitudes'] = item.get('attitudes_count')
            weibo['comments'] = item.get('comments_count')
            weibo['reposts'] = item.get('reposts_count')

            yield weibo


if __name__ == '__main__':
    json = get_page(1)# This is the number of pages
    results = parse_page(json)
    for result in results:
        print(result)

Copy the code

Dynamically render page fetching

Selenium is an automated testing tool that allows you to drive the browser to perform specific actions, such as clicking, pulling down, and so on, while also getting the source code of the page that the browser is currently rendering, visible and accessible

Selenium is only available on Chrome Version must be between 70 and 73, so we’ll skip it here

#### ii. Use of Splash ##### Install Splash Splash is a JavaScript rendering service, a lightweight browser with an HTTP API, and it connects to the Twisted and QT libraries in Python

Enabling Splash requires docker installation

Pull the mirror

sudo docker pull scrapinghub/splash
Copy the code

Start the container and monitor port 8050

docker run -p 8050:8050 scrapinghub/splash

Copy the code

Go to http://localhost:8050/ and you can see Splash’s Web page.

The default address on Splash is http://google.com. Click “Render”, and you can see the return result of the page, which presents the rendered screenshot, HAR loading statistics, and the source code of the page

We can see a script:

function main(splash, args)
  assert(splash:go(args.url))
  assert(splash:wait(0.5))
  return {
    html = splash:html(),
    png = splash:png(),
    har = splash:har(),
  }
end
Copy the code

This script is written in Lua. The main function of this code is to call the go() method to load the page and return the source code, screenshots and HAR information of the page.

As you can see in the code below, a JavaScript script is passed in via the evaljs() method, and the result of the document.title execution is the return of the title of the page

function main(splash, args)
  splash:go("http://www.baidu.com")
  splash:wait(0.5)
  local title = splash:evaljs("document.title")
  return {title=title}
end
Copy the code

#### Asynchronous processing Splash supports asynchronous processing

function main(splash, args)
  local example_urls = {"www.taobao.com"."www.zhihu.com"}
  local urls = args.urls or example_urls
  local results = {}
  for index, url in ipairs(urls) do
    local ok, reason = splash:go("http://". url)if ok then
      splash:wait(2)
      results[url] = splash:png()
    end
  end
  return results
end
Copy the code

The result is a screenshot of both sites

Object properties of Splash

The second parameter args in the main method is equivalent to the splash. Args property

function main(splash, args)
    local url = args.url
end
Copy the code

function main(splash)
    local url = splash.args.url
end
Copy the code

#####js_enabled

This property is Splash’s JavaScript switch, true or false, that controls whether the JavaScript code executes.

#####resource_timeout

This property sets the load timeout in seconds. If set to 0 or nil (similar to None in Python) it means no timeout is detected. Let’s use an example

functionMain (splash) splash. Resource_timeout = 0.1 Assert (splash:go('https://www.taobao.com'))
    return splash:png()
end
Copy the code

The #####go() go() method is used to request a link, and it can simulate GET and POST requests, while supporting Headers, Form Data, and other Data.

#####jsfunc() This method can directly call JavaScript defined methods, surrounded by double brackets, equivalent to the implementation of JavaScript methods to Lua script conversion.

#####evaljs() can execute JavaScript code and return the result of the last statement

####runjs() this method can execute JavaScript code, similar to evaljs, but more likely to perform certain actions or declare certain methods. Evaljs (), on the other hand, is more interested in getting some results executed.

function main(splash, args)
  splash:go("https://www.baidu.com")
  splash:runjs("foo = function() { return 'bar' }")
  local result = splash:evaljs("foo()")
  return result
end
Copy the code

autoload()

You can set the objects that are automatically loaded on each page visit

splash:autoload{source_or_url, source=nil, url=nil}
Copy the code

Parameter description: 1. Source_or_url, JavaScript code or JavaScript library link. 2. 3. Url, JavaScript library link

This method simply loads the JavaScript code or library and does nothing. If you want to do something, you can call the evaljs() or runjs() methods

function main(splash, args)
  splash:autoload([[
    function get_document_title() {return document.title;
    }
  ]])
  splash:go("https://www.baidu.com")
  return splash:evaljs("get_document_title()")
end
Copy the code

The call of ####Splash API enables Splash and Python program to use and capture javascripts rendered pages in combination. Splash provides some HTTP interfaces, and only needs to request these interfaces and pass the parameters of the response to obtain the results after page rendering.

curl http://localhost:8050/render.html? url=https://www.baidu.comCopy the code

This is implemented in Python as follows:

import requests
url = 'http://localhost:8050/render.html? url=https://www.baidu.com'
response = requests.get(url)
print(response.text)
Copy the code