preface

This article may be long, but it is definitely full of information and provides a lot of resources and ways to learn. To achieve the goal of allowing readers to write basic web crawlers independently, this is also the main purpose of this paper, output valuable knowledge that can really help readers, that is, teaching people to fish is better than teaching people to fish, let’s immediately start, this paper contains the following contents:

  • Python environment setup and basics
  • Overview of crawler principle
  • An overview of crawler technology
  • Cat eye movie ranking data capture
  • Ajax data to climb the cat’s eye box office
  • More advanced, proxy, simulated login, APP crawl… .

Python environment setup and basics

Python Environment setup

Anaconda installation

This article does not cover the installation of Python software. Some readers may wonder how I can learn advanced Python knowledge without installing Python. Don’t worry, here I introduce a new way to quickly install Python: install Anaconda directly. Anaconda is a Python distribution that includes Python, many common Python libraries, and a package manager called Cond. Anaconda is a Python distribution that focuses on data analysis, including conda, Python, and more than 720 scientific packages and dependencies. Python tools for enterprise-level big data analysis. It is involved in data visualization, machine learning, deep learning and other aspects. It can not only do data analysis, but also be used in the field of big data and artificial intelligence. Readers may wonder what this has to do with crawlers. Of course it does. You need to use Python libraries to write crawlers, and Anaconda already includes those libraries that you use frequently. Of course, all of this is free, so let’s start installing the wonderful Anaconda. Firstly, download the corresponding version of Anaconda from the official website of Anaconda. If the download speed is too slow, it is recommended to use the open source software mirror of Tsinghua University in China to download the corresponding version of Anaconda. The download page of Anaconda’s official website is as follows:





In this document, you are advised to download Python3.6. For example, if Python3.6 is installed on a windows-64bit computer, the downloaded version is Anaconda3-5.2.0-Windows-x86_64. Open the installation package as shown in the following figure:



Click next

Click on something I

Select Just Me and click Next

Select the installation directory and click Next

Select Add Anaconda to my PATH Environment Variable and click Install

IDE environment setup

IDE authors recommend Pycharm, the free community version of which can meet our needs. To use the tutorial, you can refer to Pycharm on CSDN blog or search Pycharm on CSDN for more information. It can be described in detail here

1.1 Basic Python technologies

I’m not going to cover the basics, because there are plenty of free basic tutorials on the Internet, but I’m going to give you free resources and methods to quickly learn the basics and advanced Python that you need to write a crawler. For basic crawlers, we need to know the following Python knowledge:

  • The data type
  • The list of
  • Looping statements
  • statement
  • function

Python based

For readers who have no Python background at all, here’s a quick start

  • | rookie Python3 tutorial tutorial
  • Python tutorial | Liao Xuefeng’s official website
  • Python tutorial | introductory tutorial
  • Python tutorial ™
  • Stupid to learn Python | see cloud
  • The Python Tutorial | Python 3.6.6 documentation
The Python official documentation is absolutely authoritative and comprehensive, but the documentation itself is in English, so it is not very friendly for most beginners. The following is the Python version translated by Chinese people.
  • Python start guide | Python tutorial 3.6.3 documentation some introductory books recommended
  • Python Programming from Beginning to Practice
  • For those who want to improve their skills, the next series of books and materials should be right up your street:
  • The Python Standard Library
  • Python CookBook 3
  • Smooth Python
  • Python Learning Manual (4th Edition)
  • Python Core Programming (Version 3)
  • Python Language Description of Data Structures
  • High Performance Programming in Python

What is a reptile

The crawler principle

What is a reptile? A crawler is essentially emulating an HTTP request, and that’s what we’ll often need to do later, remember. There are two ways for ordinary users to obtain network data: a. Browser submits HTTP request -> download web code -> parse into a page. B. Simulate the browser to send a request (get web code)-> extract useful data -> store in a database or file. Crawler is doing the second thing, and the general process is as follows: I. Send a Request to the target site through HTTP library, that is, send a Request, which can contain additional headers information, and wait for the response from the server II. If the server responds normally, it will get a Response, and the content of the Response is the content of the page to be obtained, which may be HTML, JSON, binary file (such as image, video, etc.). Iii. The obtained content may be HTML, which can be parsed with regular expressions and web page parsing libraries. It could be JSON, which can be converted directly into a JOSN object for parsing, or it could be binary data, which can be saved or further processed iv. Save in a variety of forms, can be saved as text, can also be saved to a database, or save in a specific format file. Many readers may not know what the specific above doing, so next we through the browser to capture the package analysis of the above process, I recommend the use of Chrome, very friendly to developers, we will often use the follow-up, Chrome download, if the download speed is slow, it is recommended to use domestic Chrome image download installation. First, open the browser and type https://www.baidu.com/ in the address bar (readers can also use other web testing such as our https://gitbook.cn/), press Enter, Baidu page is displayed, and then press F12, the shortcut key of the browser developer option. Select “Network” and open the interface as shown below:

Press F5 to refresh the page:

The column updates a large number of packets, these packets are browser requests for data, we want the data in these requests

  • The first column Name: the Name of the request, usually the last part of the URL.
  • The second column Status: the Status code of the response, which is displayed as 200, indicating that the response is normal. Using the status code, we can determine whether the request received a normal response after sending it.
  • The third column Type: the document Type requested. This is called document, which means what we’re asking for this time is an HTML document, and the content is just some HTML code.
  • Column 4 initiator: request source. Used to mark the object or process from which the request originated.
  • Column 5 Size: the Size of the file downloaded from the server and the requested resource. If the resource is fetched from the cache, this column displays from Cache.
  • Column 6 Time: The total Time from the Time the request was initiated to the Time the response was received.
  • 7th column Waterfall: Visual Waterfall flow with network requests. Next, let’s analyze the detailed composition of the request. For example, click the first request, which is the request whose Name is www.baidu.com, as shown in the figure below:

We see the General part of the response, the request header, and the response header

General generally contains the following parts:

  • Request URL indicates the URL of the Request
  • Request Method indicates the Request Method
  • Status Code is the response Status Code.
  • Remote Address Indicates the Address and port number of the Remote server
Response Headers generally contain the following sections (Response (server -> client [Response])) :

  • HTTP/1.1 Protocol and version number used for the response 200 (status code) OK(Description)
  • Location indicates the path of the page that the server requires the client to access
  • Server indicates the name of the Web Server on the Server
  • Content-encoding Specifies the Encoding type that the server can send
  • Content-length Specifies the Length of compressed data sent by the server
  • Content-language Indicates the Language type sent by the server
  • Content-type indicates the Type and encoding mode sent by the server
  • Last-modified Indicates the time when the server Last Modified the resource
  • Refresh for the server to ask the client to Refresh after one second and then access the specified page path
  • Content-disposition for the server requires the client to open the file as a download
  • Transfer-encoding Transfers data to the client in blocks
  • Set-cookie is the temporary data sent by the server to the client
  • Connection Maintains the Connection between the client and server
Request Headers generally contain the following sections (Request (client -> server [Request])) :

  • GET(request mode) /newcoder/hello.html(request target resource) HTTP/1.1(request protocol and version)
  • Accept indicates the type of resources that can be received by the client
  • Accept-language indicates the Language type received by the client
  • Connection Maintains the Connection between the client and server
  • Host: localhost indicates the destination Host and port number
  • Does Referer tell the server where I’m from
  • User-agent indicates the version number of the client
  • Accept-encoding Specifies the type of compressed data that can be received by the client
  • If-modified-since indicates the cache time
  • Cookie indicates that the client temporarily stores information about the server
  • The Date for the time to serve the client requests And we need to do is to simulate the browser to submit Requests Headers for the server’s response information, and the data, we want to want to understand the reader, please visit HTTP | MDN documentation for more information.

What kind of data can a crawler capture

We can see all kinds of information in the web page, the most common is the user can see the web page, and through the developer tools for web browser requests caught when we can see a lot of request, that some of the web pages returned not HTML code, may be a json string, all kinds of secondary data, such as image, audio, video, etc., Of course, some are CSS, JavaScript, etc. Therefore, the crawler can get all the data that the browser can get, and the data of the browser is the information translated for the user to see, that is, as long as the information can be accessed in the browser, the crawler can capture.

An overview of crawler technology

^_^: this section introduces the techniques commonly used by crawlers such as requests: requests, information extraction: Xpath, Re Re, json, storage: CSV, MySQL, MongoDB, simulation browser Selenium, to ensure that the technology involved in the project will be the reader, that is, here needs to explain the use of these technologies.

First request

Requests the library

The Requests library is described in the documentation as “the only Non-gmO Python HTTP library for human consumption.” Warning: Unprofessional use of other HTTP libraries can lead to dangerous side effects, including security flaws, redundant code, reinventing the wheel, document eating, depression, headaches, and even death. Requests was developed around the aphorism of PEP 20, the famous Zen of Python. Here’s the philosophy of Requests development for you to peruse and write more Pythonic code.

Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex. Complex is better than complicated.
In 2.1, we talked about the principle of crawler is to make HTTP request and get the response, extract the information we want from the response and save it. The Requests library is a great way to emulate HTTP Requests using Python. If Anaconda is installed, the Requests library is already available. If not, you can install the Requests library from the command line (win+R enter CMD). Let’s move on to our first request! Using Requests to send HTTP Requests is pretty simple, so let’s take GitChat as an example:

Import the Requests module
import requests
Make a Get request and return a Response object containing the server's Response to the HTTP request
response = requests.get('https://gitbook.cn/')
Print the response status code
print(response.status_code)
Print the response body of type STR, such as a normal HTML page, using text for further analysis of the text
print(response.text)

Copy the code
The partial results are shown below:

Requests not only supports Get Requests, such as Post Requests:

Import the Requests module
import requests
Form data to submit
data = {
    'name': 'ruo'.'age'22} :# Post request
response = requests.post("http://httpbin.org/post", data=data)
# Response body content
print(response.text)

Copy the code
The partial results are shown below:



Of course, there are many more types of Requests available, such as Get and Post, which I will not demonstrate.

# PUT requestRequests. The put (" http://httpbin.org/put ")# the DELETE requestRequests. The delete (" http://httpbin.org/delete ")# HEAD requestRequests. The head (" http://httpbin.org/get ")# the OPTIONS requestRequests. The options (" http://httpbin.org/get ")Copy the code
Since most servers can identify the operating system and version, browser and version of the client through user-Agent in the request header, crawler also needs to add this information to disguise the browser. If it is not added, it may not be identified as a crawler. For example, when we do not add Headers to zhihu for GET request:

Import the Requests module
import requests
Make a Get request
response = requests.get("https://www.zhihu.com")
# status code
print(response.status_code)
# Response body content
print(r.text)

Copy the code
What is returned is as follows:



Add Headers to the request, add user-agent information, and try again:

Import the Requests module
import requests
Add user-agent to Headers
headers = {
    'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'
}
Make a Get request
response = requests.get("https://www.zhihu.com", headers=headers)
# status code
print(response.status_code)
# Response body content
print(response.text)

Copy the code
What is returned is as follows:



You can see that the request succeeded and that the correct response status code and body are returned. Readers who want to learn more about Requests can visit the Requests official documentation or the Chinese documentation.

Extracting information

When we get the response through HTTP request, we need to extract the content in the response body. Here, the author introduces two commonly used extraction methods, one is regular expression, the other is Xpath.

Regular expression

Regular expression is a very powerful string processing tool. Almost any operation on string can be completed by using regular expression. As a crawler who deals with strings every day, regular expression is an indispensable skill. With this, it is very convenient to extract the information you want from THE HTML. On the official website of the reader can through the regular expression | Liao Xuefeng quick start, but can be by Python regular expression | novice tutorial learning Python regular operation and using regular, Python’s official documentation in section 6.2 of the Python standard library also has detailed introduction and use of the tutorial to Re. Regular expressions are a small, highly specialized programming language. After all, here is a general rule for extracting information from regular strings.*? This rule can match any character in a non-greedy way, and we’ll use it a lot later. For example, if we need to match the contents of

Chapter 1, we can:

Import the re module
import re
Text to be matched
h1 = '

Chapter 3.2.1 - Regular Expressions

Compile the regular string into a regular expression object for reuse in later matches pat = re.compile('

(.*?)

'
, re.S) # re.search scans the entire string and returns the first successful match result = re.search(pat, h1) Group () can enter more than one group number at a time, in which case it will return a tuple containing the values corresponding to those groups. print(result.group(0)) Group () can enter more than one group number at a time, in which case it will return a tuple containing the values corresponding to those groups. print(result.group(1)) Copy the code
Here are the matching results:



Xpath

XPath, or XML Path Language, is a Language for determining the location of parts of AN XML document. XPath is based on an XML tree structure and provides the ability to find nodes in a tree of data structures. XPath was originally proposed as a generic syntactic model between XPointer and XSL. But XPath was quickly adopted by developers as a small query language, and it was also a good tool for extracting information from crawlers. The reader can through the Xpath tutorial | novice Xpath tutorial learning principle and writing method, can also access the CSDN blog search Python Xpath learn more Python Xpath basic operations, then introduce to write “write” skill and method used in Python, The reason for adding “compose” is as the reader will see below. Remember the browser developer tool used in 2.1 crawler principle, we can directly obtain Xpath rules of corresponding nodes through this tool, so as to achieve the purpose of quickly extracting web page information by Xpath, such as extracting movie information in the TOP100 list of Cat’s eye movies. First, open a browser enter http://maoyan.com/board/4, move the mouse to need to extract the information (name), right click select check, as shown in the figure below:



Next we select the following elements, right-click Copy–>xpath, as shown in the following image:



Now that we have the xpath rule for this node, we’ll write Python to verify that the rule can actually extract the movie name:

import requests
Import the ETREE module of LXML library
from lxml import etree
headers = {
    'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'
}
url = 'http://maoyan.com/board/4'
response = requests.get(url, headers=headers)
html = response.text
Call the HTML class for initialization
html = etree.HTML(html)
# Paste our copy xpath to extract the movie title "Farewell my Concubine"
result_bawangbieji = html.xpath('//*[@id="app"]/div/div/div[1]/dl/dd[1]/div/div/div[1]/p[1]/a')
Print the text content contained in the node tag
print(result_bawangbieji[0].text)
# Extract all movie names of this page, that is, select all movie names of the 'DD' tag
result_all = html.xpath('//*[@id="app"]/div/div/div[1]/dl/dd/div/div/div[1]/p[1]/a')
Print all extracted movie names
print('All movie names on this page:')
for one in result_all:
    print(one.text)

Copy the code
As shown below, we successfully extracted the movie name information from the HTML:



Store information

TEXT storage

For those who have learned the basics of Python, you should be familiar with the basic way of storing information by writing it directly to a file, such as a common TEXT file. For those who are not familiar with it, you can read and write from a Python file – Python Tutorial ™ quick overview. Let’s store the movie names extracted by Xpath in 3.2.2:

import requests
from lxml import etree
headers = {
    'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'
}
url = 'http://maoyan.com/board/4'
response = requests.get(url, headers=headers)
html = response.text
Call the HTML class for initialization
html = etree.HTML(html)
# Paste our copy xpath to extract the movie title "Farewell my Concubine"
result_bawangbieji = html.xpath('//*[@id="app"]/div/div/div[1]/dl/dd[1]/div/div/div[1]/p[1]/a')
Print the text content contained in the node tag
print(result_bawangbieji[0].text)
Extract all movie names from this page
result_all = html.xpath('//*[@id="app"]/div/div/div[1]/dl/dd/div/div/div[1]/p[1]/a')
Print all extracted movie names
print('All movie names on this page:')
for one in result_all:
    print(one.text)
# Store the movie name of this page in a TEXT file. 'A' means to open a file for append. If a file exists, the file pointer is at the end of the file. That is, the file is in append mode. If the file does not exist, it creates a new file to write to.
with open('film_name.text'.'a') as f:
    for one in result_all:
        f.write(one + '\n')

Copy the code
The storage results are as follows:





CSV storage

CSV files are comma-delimited values (also called character delimited values, because delimiters may not be commas). CSV files are common text formats that store table data, including numbers and characters, in plain text. There is a built-in CSV file operation module in Python. You only need to import it to store CSV files. Below, we store the movie names extracted from Xpath in 3.2.2 into CSV files:

import requests
from lxml import etree
# Import CSV module
import csv

headers = {
    'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'
}
url = 'http://maoyan.com/board/4'

response = requests.get(url, headers=headers)
html = response.text
html = etree.HTML(html)
result_bawangbieji = html.xpath('//*[@id="app"]/div/div/div[1]/dl/dd[1]/div/div/div[1]/p[1]/a')
print(result_bawangbieji[0].text)
result_all = html.xpath('//*[@id="app"]/div/div/div[1]/dl/dd/div/div/div[1]/p[1]/a')
print('All movie names on this page:')
for one in result_all:
    print(one.text)
# Store the movie name in a CSV file:
with open('film_name.csv'.'a', newline=' ') as f:
    csv_file = csv.writer(f)
    for one in result_all:
        csv_file.writerow([one.text])

Copy the code
CSV file storage results are shown as follows:



MySQL storage

MySQL is the most popular relational database management system. If you do not have MySQL installed, you can use PHPStudy 2018 to download phpStudy 2018 to install MySQL quickly. In Python2, most libraries connected to MySQL use MySQLdb. But this library does not support the official Python3, so here it is recommended to use library is PyMySQL, the reader can through the Python + MySQL database operations (PyMySQL) | ™ learning Python tutorial PyMySQL MySQL operation related methods and instances, Next we will try to store the movie names extracted by Xpath in 3.2.2 into MySQL. Readers who do not have this module can install the Pymysql library by (win+R type CMD) PIP install pymysql.

import requests
from lxml import etree
Import the pymysql module
import pymysql

Open a database connection
db = pymysql.connect(host='localhost', user='root', password='root', port=3306, db='spider', use_unicode=True, charset="utf8")
Select * from MySQL where cursor is used to execute SQL statement
cursor = db.cursor()

headers = {
    'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'
}
url = 'http://maoyan.com/board/4'

response = requests.get(url, headers=headers)
html = response.text
html = etree.HTML(html)
result_bawangbieji = html.xpath('//*[@id="app"]/div/div/div[1]/dl/dd[1]/div/div/div[1]/p[1]/a')
print(result_bawangbieji[0].text)
result_all = html.xpath('//*[@id="app"]/div/div/div[1]/dl/dd/div/div/div[1]/p[1]/a')
print('All movie names on this page:')
for one in result_all:
    print(one.text)

    try:
        Insert data statement
        sql = 'INSERT INTO film_infor(film_name) values (%s)'
        cursor.execute(sql, (one.text))
        db.commit()

    except:
        db.rollback()

Copy the code
The MySQL storage result is as follows: