Hello, I’m Yue Chuang.

First, Python installation stuff

Python 1.1 installation

Download method

Visit the official website: www.python.org

As shown in figure:

  1. Select the Overhead Downloads option
  2. Select your system in the popup option box (note: if you directly click the gray button on the right, you will download 32 bits)

Enter the download page, as shown in the figure:

  1. Download for 64-bit files
  2. Download for 32-bit files

Select your corresponding file download.

Installation Precautions

(Photo from Internet)

Customizing options to choose where to store files, etc., makes Python more user-friendly.

Default installation: all the way Next to the end, installation is more convenient and faster.

Special note: make sure to check the arrow in the picture. Otherwise, you have to manually configure environment variables.

Q: How do you configure environment variables?

R: Control Panel – System and Security – System – Advanced System Settings – Environment Variables – System Variables – Double-click path – Go to the edit environment variables window and fill in the blank space where Python is located – Ok.

check

After installing Python, Win+R opens the run window and enters CMD to enter the command line mode and enter Python. If the Python version number and other commands are shown in the following figure, Python is successfully installed.

1.2 The Python compiler Sublime

Website: www.sublimetext.com/

Reasons for choosing this editor:

  1. Don’t need too much programming foundation, quick start
  2. Fast startup speed
  3. The key reason — free

Q&A

You cannot run the command using the shortcut key Ctrl+B. You can try Ctrl+Shift+P and select Bulid With: Python in the displayed window.

Or select the Build With option in the Tool option at the top, and select Python in the window that pops up.

1.3 Crawler common library installation

Python uses PIP install XXX to enable the automatic installation of Python libraries.

Requests

The code is as follows:

pip install requests
Copy the code
BeautifulSoup

The code is as follows:

pip install lxml
pip install BeautifulSoup4

Copy the code
Scrapy

Methods a

  • The Window system

The code is as follows:

pip install scrapy
Copy the code
  • Mac system

The code is as follows:

Xcode-select –install

pip3 install scrapy
Copy the code

Method 2

The first method often produces errors, so we have to install the dependency libraries first.

1. Install the dependency library LXML as follows:

pip install lxml
Copy the code

2. Install the dependent library pyOpenSSL

  • Enter the pypi.org/project/pyO… Download the wheel file
  • In the command line window, enter CD + download location in the download directory

The code is as follows:

pip install pyOpenSSL-18.0. 0-py2.py3-none-any.whl
Copy the code

3. Install the Twisted library

Go to www.lfd.uci.edu/~gohlke/pyt… To download the corresponding version.

The code is as follows:

Pip install The file name that was just downloadedCopy the code

4. Install the pyWin32 dependency library

Go to the website to download the corresponding version:

Sourceforge.net/projects/py…

After downloading, double-click EXE file to install.

5. Install Scrapy

The code is as follows:

 pip install Scrapy
Copy the code

Quick introduction to Python

Preface: although this section is zero foundation friendly article, but also has the depth of some knowledge points to expand, has the programming foundation to see the officer can also watch selectively oh!

2.1 Python Interactive mode and Command line Mode

Command line mode

1. Access mode

Windows:

  1. Click Start, run, CMD enter
  2. Press WIN+R, CMD enter

Mac:

  1. Open Launchpad in the Apps menu, find and open the other folder, and click Terminals.
  2. Open a Finder window and search for the terminal keyword directly in the Applications directory

2. Prompt

The command prompt varies with the operating system. The Following uses Windows as an example

C:\ machine name \ username >Copy the code

Interactive mode

1. Access mode

Enter by typing a Python command in command mode, and exit() to exit interactive mode.

2. Prompt: >>>

The difference between
  1. The py file can only be run on the command line;
  2. Python interactive mode code is input one line, execute one line; Running the.py file directly on the command line executes all the code in the file at once.

From this point of view, Python interaction mode is primarily used to debug code.

2.2 Data types and variables

The main data types in Python are: int, float, bool, string, list, tuple, dict, and set.

  2                 # integer (int)
  3.1314526         # float
  True              # Boolean value (bool)
  "1"               # string (STR)
  [1.2."a"]         # list (list)
  (1.2."a")         # tuples (a tuple)
  {"name":"Xiao Ming"}   Dict (# dictionary)
Copy the code

In Python, you can use # to comment information that the IDE ignores automatically when compiling.

The integer

In line with the concept of integers in mathematics, there are four base representations: binary, octal, decimal and hexadecimal. By default, integers are in decimal.

(Photo from Internet)

Floating point Numbers

Represents a number with a decimal point. Floating-point numbers are represented in two ways: decimal and scientific notation. (Note: calculators or computers express the power of 10 using E or E, i.e. 2.88714E13=28871400000000)

Boolean value

Boolean values in Python have two quantities: True and False, which are 1 and 0 respectively (True and False are case-sensitive).

The code is as follows:

 var1 = 12
 var2 = 12
 var3 = 13
 print(var1==var2) # output True
 print(var1==var3) # output False

Copy the code

Var1 ==var2 == = is the comparator, compares whether var1 is equal to var2, if True, otherwise False.

In addition, Boolean values can be computed with and, OR, and not.

The code is as follows:

And computing: unflinching, demanding allTrueOtherwise, the output isFalse.True and True #True
True and False #False
False and False #False
Copy the code

The code is as follows:

Or operation: requirements are not high, as long as there is aTrueThe output is zeroTrue.True or True #True
True or False #True
False or False #False
Copy the code

The code is as follows:

Non - operation: always a naysayer, inputTrueIt gives you outputFalseAnd vice versa. (Special note: it is a monocular operator)not True #False 
not False #True
Copy the code
string

Strings are arbitrary text enclosed in single or double quotation marks, such as’ aaa ‘, “ABC”. ” or “” are just representations themselves, not part of the string, so the string ‘aaa’ has only three characters, aaa.

What if the string already contains’ or “? We can identify it with the escape character \, as in:

The you’re string is represented as:

"you\' re"
Copy the code

What if the string content contains’ also contains \? Then we can use \\ as the code:

"you\\'re"
Copy the code
The list of

Lists are one of the more important data containers in Python.

The code is as follows:

 list1 = [1.2.3.4.5]
 list2 = ["AI yue chong"."GitChat"."Fly"]
Copy the code

Lists are indexed, so to access values in a list, you simply need the list name + the index value.

 print(list1[2])  Output: 3
 print(list2[0])  Output: AI Yue Chuang
 Example # 2
lists = ['a'.'b'.'c']
lists.append('d')
print(lists)
print(len(lists))
lists.insert(0.'mm')
lists.pop()Delete the last element
print(lists)
# output
['a'.'b'.'c'.'d']
4
['mm'.'a'.'b'.'c']
Copy the code
tuples

Tuple creation is as simple as adding elements in parentheses, separated by commas.

Example code:

tup1=('aaa'.1.'bbb'.2)
Copy the code

Note: When a group contains only one element, you need to add a comma after the element, otherwise the parentheses will be used as operators.

>>> tup1=(1)  
>>> type(tup1)
<class 'int'> > > >tup2= (1.) > > >type(tup2)
<class 'tuple'>
Copy the code
The difference between lists and tuples

I don’t know if you’ll notice any similarities between lists and tuples, but the main differences are:

  1. Tuples use parentheses and lists use square brackets.
  2. Lists are dynamic, their length is not fixed, and you can add, subtract, or change elements at will (mutable).

Tuples are static and cannot be added, subtracted or changed (immutable).

In fact, ** is the most important difference between lists and tuples, and this difference will affect how they are stored. We can look at the following example:

l = [1.2.3]
l.__sizeof__()
64
tup = (1.2.3)
tup.__sizeof__()
48
Copy the code

You can see that for lists and tuples, we put the same element, but the tuple has 16 bytes less storage space than the list. Why is that?

In fact, since the list is dynamic, it needs to store Pointers to the corresponding elements (in the case of ints, 8 bytes). In addition, because the list is mutable, you need to store the size of the allocated length (8 bytes) so that you can track the list space usage in real time and allocate additional space when you run out of space.

Example code:

L = [] L.__sizeof__ () // The storage space of the empty list is40byte40
l.append(1)
l.__sizeof__() 
72// Add an element1After that, the list allocates it for storage4The space of two elements (72 - 40) /8 = 4
l.append(2) 
l.__sizeof__()
72// Add elements because space was allocated earlier2, list space unchanged L.append (3)
l.__sizeof__() 
72/ / same as above L.A. ppend (4)
l.__sizeof__() 
72/ / same as above L.A. ppend (5)
l.__sizeof__() 
104// Add elements5Later, the list ran out of space, so extra storage was allocated4The space of one elementCopy the code

The above example Outlines the process of list space allocation. We can see that, in order to reduce the overhead of allocating space per add/subtract, Python allocated more and more space each time, such mechanism (over-allocating) ensured the efficiency of its operation: the time complexity of adding/removing was O(1).

But for tuples, it’s a different story. The length is fixed and the elements are immutable, so the storage space is fixed.

Looking at the previous analysis, you might think that such differences are negligible. But imagine if lists and tuples stored 100 million, billion or more elements. Could you ignore the difference?

So we can conclude that tuples are lighter than lists, so overall, tuples perform slightly faster than lists.

The dictionary

A dictionary is a special kind of list in which each pair of elements is divided into keys and values. Add, delete, change and check values are accomplished by keys. Note: The build /KEY in the dictionary must be immutable data types, such as int, float, string, and tuple.

The following code

 brands = {"Tencent":"Tencent"."Baidu":"Baidu"."Alibaba":"Alibaba"}

 brands["Tencent"]  Get value with key value "Tencent"
 del brands["Tencent"] # Delete Tencent
 brands.values[] # get all values
 brands.get("Tencent")  Get value with key value "Tencent"
Copy the code
A collection of

A set is an unordered sequence of non-repeating elements that can be created by {} or set().

The code is as follows:

set1={'a'.'aa'.'aaa'.'aaaa'} #{'aaa', 'aa', 'aaaa', 'a'}
set1=set(['a'.'aa'.'aaa'.'aaaa'])
print(set1)  #{'aaaa', 'aa', 'a', 'aaa'}
Copy the code

Note: To create an empty collection, you must use set() instead of {}, because {} is used to create an empty dictionary.

>>> s={}
>>> type(s)
<class 'dict'>
Copy the code
expand

In the context of data types, we talk a lot about mutable objects, immutable objects, but what is mutable and what is immutable?

Here’s a hint:

  • Python immutable objects: int, float, tuple, string
  • Python mutable objects: list, dict, set

A mutable object is an object whose elements can be changed. An immutable object is an object whose elements cannot be changed. The difference between the two is that its elements can be changed.

In what way do we usually try to modify objects? As the saying goes, “things are not much, but in common”, here, we introduce “add, delete, change and check” these common methods.

Take the mutable object list as an example, add append and INSERT.

The code is as follows:

>>> list= ["a"."b"]
>>> list.append("c") # append(element) to add the element to the list
>>> print(list)
['a'.'b'.'c']

>>> list.insert(0."d")#insert(index, element) to add the element to the specified position
>>> print(list)
['d'.'a'.'b'.'c']
Copy the code

Delete: remove(), pop(index), pop()

Run the following code:

>>> list.remove("d")#remove objectionable elements from the list
>>> list
['a'.'b'.'c']
>>> list.pop(1)
'b'# Deleted element
>>> print(list)
['a'.'c']#pop(index), remove the element specifying the position
>>> list.pop()
'c'# Deleted element
>>> print(list)#pop(), deletes the last element by default
['a']
Copy the code

Modified: list [index] = element

The code is as follows:

>>> list= ['a'.'c']
>>> list[0] ='b'# Replace the element in the specified position
>>> print(list)
['b'.'c']
Copy the code

Find: list [index]

The code is as follows:

>>> list= ['b'.'c']
>>> list[1]Find the element at the specified position
'c'
Copy the code

All mutable object lists are modified successfully. Now let’s try a simple change to the immutable tuple and see what happens.

>>> tuple1=(1.2.3.4)
>>> tuple1[0] =5
Traceback (most recent call last):
  File "<stdin>", line 1.in <module>
TypeError: 'tuple' object does not support item assignment# error
Copy the code

Through the above changes to the list and tuple, we can confirm one of the most straightforward differences between immutable objects and mutable objects: mutable objects can be modified, while immutable objects cannot be modified.

variable

When we type 1+1 in the CMD console, the console outputs 2. But what if we want to continue to use this 2 in future calculations? We need a “variable” to store the value we need.

The code is as follows:

a=1+1  A is a variable used to store the 2 produced by 1+1
Copy the code

As shown in the chestnut above: Variable assignment in Python does not require a type declaration.

Tip: creating a variable creates a space in memory. Based on the variable’s data type (default to integer if no data type is specified), the interpreter allocates specified memory and decides what data can be stored in memory.

expand

Isn’t that amazing? B =a If A has changed, b should have changed with it.

Let’s make a hypothesis:

Assuming that developer = memory, variable = house, and variable stored value = households, before B = A, the general trend of A =1 makes developers build house A well. When B = A is copied, the developer draws a block of memory to build house B without stopping, and both house B and House A live with the value 1. So when a=4, house A has a new inhabitant, but it doesn’t affect the inhabitant of house B — the inhabitant of number 1.

Conditions, loops, and other statements

Python uses if and else as conditional statements.

Example code:

# If... The else...
i = 1
if i == 1:
    print("Yes,it is 1")
else:
    print("No,it is not 1")
# if... The else... Is a classic judgment statement. Note that there is a colon after the if expression and also a colon after the else.
Copy the code

The above statement determines whether the variable I is equal to 1. Note that Python takes indentation seriously. So when you write a statement, you need to pay attention to whether the indentation is in the same area.

Python supports for loops and while loops. Loop statements are similar to if and else statements, such as colon and indentation.

for i in range(1.10) :print(i)
Copy the code

The above statement prints numbers between 1 and 10. Note that range(1,10) ranges from 1 to 9 and does not include 10.

i = 1
while (i<10):
    i += 1
    ifi! =8:
          continue
    else:
          break
Copy the code

In the above statement, the word break means: to break out of the loop; Continue means: to continue the cycle.

3.1 the function

We normally use print() and type(), both of which are functions. For repetitive code segments, we don’t need to write them out every time, just call them by the name of the function.

The keyword for defining a function is def, defined in much the same way as for loops.

The code is as follows:

def function(param) :  # function = function, param = parameter
    i = 1
    return i  # f returns the value
Copy the code

And just to make it a little bit more graphic, let’s write a sum of a plus b.

def getsum(a,b) :  GetSum (a,b)
    sum = a+b;
    return sum;  # return the sum of a+b, sum

print(getsum(1.2))
Copy the code

Once defined, we can use this function elsewhere in the program by calling getSum(a,b).

3.2 file

Python provides rich and easy-to-use file manipulation functions, so let’s take a quick look at common operations.

open()

The code is as follows:

open("abc.txt"."r")  
#open() is Python's built-in file function for opening files, "abc.txt" for the target file name, "r" for open files in read-only mode, and other "W" and "A" modes
Copy the code
read()

Open files that must be retrieved using the.read() method.

file = open("abc.txt"."r")
words = file.read()
Copy the code

For detailed reading, you can refer to this article: With Open ~, a collection of basic file operations in Python.

Four, web page basis

4.1 What is a Web page

When you type www.baidu.com into your browser and press Enter to visit, everything you see displayed on your screen is actually a web page. Web pages are identified and accessed through urls. According to Baidu Baike, a web page is defined as the following description.

Web Page (English: Web Page) Web Page is the basic element of a website, and is the platform for carrying various Web applications. In plain English, your website is made up of web pages. If you only have the domain name and virtual host without making any web pages, your customers will still be unable to access your website.

A web page is a plain text file containing HTML tags that can be stored on a computer somewhere in the world Wide Web. It is a “page” in the World Wide Web and is in the hypertext Markup Language format (an application of standard Common Markup Language with file extensions of.html or.htm). Web pages often use image files to provide pictures. Web pages should be read through a web browser.

Simply put, any page you see in your browser is a web page.

4.2 Why learn web knowledge

The most important thing to learn about the basics of the web is that the techniques that Chat will teach later involve analyzing and crawling web content. Even if you’re just starting out as a crawler, you need to know something about the web. As a developer, it’s not just about knowing how, it’s about knowing why. Blindly Copy code, do not understand why to do so, but will greatly reduce the effect of learning. Therefore, I have a learning method to share with you on my official account:

Mp.weixin.qq.com/s/W4yf0eoUP…

4.3 Process of browsing web pages

  • Enter url
  • The browser sends a request to the DNS service provider
  • Find the corresponding server
  • The server resolves the request
  • The server processes the request and sends back the final result
  • The browser parses the returned data
  • Show to the user

4.4 About Domain Names

We write crawler is inseparable from the domain name, or we simply understand as URL, the first step is to analyze the law among them.

I want to know which of the first level domain name, the second level domain name. The diagram below:

(Photo from Internet)

Which part of the url is the url parameter? Think about it and read on.

(Photo from Internet)

4.5 Front-end Introduction

So, here’s the problem!

What is the front end? What is the back end?
  • Programmer A: I work on the back end
  • Programmer B: I’m on the front end

What can we learn from these two simple sentences?

In fact, you can simply understand that the front end is mainly to do human-computer interactive interface, the back end is mainly to do code. I’m not going to talk about the before and after difficulty or anything else, but you can get a simple idea of what’s going on at the front and back end.

So what’s one of the tools you use when you’re developing a web page?

I can give you an answer. One of the tools used on the front end is Chrome or Firefox. As a crawler engineer, it is too low not to use one of these front-end development tools, then why choose two and not support domestic 360 browser, QQ browser or IE browser?

I split this question into two answers:

  1. 360 browser and QQ browser and other browsers are actually Google, but 360 browser and QQ browser some of the developers you need to learn tools, the lack of some of the things you need.
  2. As for Internet Explorer, Microsoft has indicated that it will stop updating Internet Explorer, urging users not to use it.
The front end has three important aspects

(Photo from Internet)

This three aspects have the interest can understand by oneself.

4.6 HTML

HyperText Markup Language

The most basic element of a web page, markup language is used to organize content (text, images, video).

The one on the right is HTML:

This part of the front end, you need to knock the experience:

Create a new file >>> Save the file >>> Name the file >>> name it aiyc.html >>> then use Sublime Text3 to type in the following content experience

The code is as follows:

<! DOCTYPEhtml>
<html>
    <head>
        <meta charset="utf-8">
        <meta http-equiv=" X-UA-Compatible" content=" IE=edge,chrome=1">
        <title>AI yue"</title>
        <meta name="description" content="">
        <meta name="keywords" content="">
    </head>

    <body>
            <h1>This is a big headline</h1>
            <p>This is the paragraph</p>
    </body>
</html>
Copy the code

Browser opening result:

The < HTML >
,

, and so on in the above example are considered HTML “tag tags” as long as they are wrapped in <>. It is important to note that the “title tag” usually has a start tag and an end tag. The normal title tag is usually used as < tag > content
. Let’s explain in detail the “tag tag: [1]” in the above example.

  • <! DOCTYPE html>Is an HTML declaration. The purpose of this method is to make it easier for the browser to get an accurate version of the HTML so that it can render the content of the page correctly. (For HTML version questions, you can refer to this article:Version history of the HTML standard).
  • <html></html>Is the root element of HTML. All the contents of an HTML document must go into this tag.
  • <head></head>Is HTML meta data.
  • <meta>Provides meta information about HTML pages, such as defining how to code a page and managing keywords for search engines.
  • <title></title>It’s the title of the page, but when we open a page, the TAB name that the browser displays is the text in the title.
  • <body></body>Is all the content contained in an HTML document (for example, text, video, audio, and so on).
  • <h1></h1>Used to define headings. In HTML, h is defined exactly as the header size. There are 6 levels of headings<h1>-<h6>, the text from large to small.
  • <p></p>Is the paragraph tag of an HTML page. This element must be used in HTML if the text is on another line.
The head and body

4.7 HTML element parsing

Common HTML tags

The CSS profile

Cascading Style Sheets refers to CSS: It defines how a web page should display elements, such as whether a paragraph should be on the left, right or center of the browser, and what font, color, size and so on the text should be defined by CSS (the style used to render element tags in HTML documents).

There are three common ways to use the CSS:

  • Inline: Use the “style” attribute directly in HTML elements.
  • Internal style sheet: in<head></head>In the tag<style>Use CSS in the element.
  • External references: Use externally defined CSS files.

inline

To use CSS inline, you only need to use style attributes in the associated tags, and no additional configuration is required.

<p>This is normal paragraph text</p>
<p style="color:red">This is paragraph text using inline CSS</p>
Copy the code

The two paragraphs above, rendered by the browser, should look like this:

Internal style sheet

Although the inline way is simple, but if there are many tags, one by one to add, undoubtedly waste the very precious time. Internal stylesheets are used when you have a uniform style for a heading or when you want to make it easier to manage related styles

The code is as follows:

<head>
 <style type="text/css">
   p {color:red; }</style>
</head>
Copy the code

The internal style sheet should be defined in the section with the

External style sheets

Think about it: you have 100 pages using CSS. If you use inline styles, your workload is alexandrine. If you use an internal style sheet, you’ll have to repeat 100 times. At this point, you need an external style sheet to put out the fire. External stylesheets can change the appearance of an entire web page in a single file.

The code is as follows:

 <head>
   <link rel="stylesheet" type="text/css" href="gitchat.css">
 </head>
Copy the code

In the example above, we imported an external stylesheet called gitchat.css via . Gitchat.css is a list of CSS styles that have been written. When we have a tag in our file, the browser will automatically configure the style for us.

CSS analytical

  • There is only one ID per HTML
  • There can be multiple classes

The box model

(Photo from Internet)

JavaScript

A programming language used primarily at the front end to provide dynamic, interactive effects for web sites.

Five, crawler foundation

5.1 Explanation of basic principles of crawlers

First, let’s take a look at what the Internet is:

What is the Internet?

The Internet is made up of network devices (network cables, routers, switches, firewalls, etc.) and computers, just like a network.

What was the Internet built for?

The core value of the Internet lies in the sharing/transmission of data: data is stored in a computer, and the purpose of connecting computers together is to facilitate the sharing/transmission of data between each other, otherwise you can only take a USB disk to copy data on someone else’s computer.

What is Surfing the Internet? What does a reptile do?

The so-called Internet access is the process in which the client computer sends a request to the target computer and downloads the data of the target computer to the local area.

However, the way users access network data is:

The browser submits the request >>> downloads the web code >>> parses/renders the page

And what a crawler does is:

Simulate a browser to send a request >>> download the web code >>> extract only useful data >>> store it in a database or file

Here’s the difference: our crawler only extracts the data that is useful to us from the web code.

5.2 the crawler

The basic flow of crawlers

In order to make you have a better experience, I convert the word into a picture to show you the following picture:

What are Request and Response?

What does Request contain?

Request way

  • Common request modes are GET and POST
  • Other request modes: HEAD, PUT, DELETE, and OPTHONS

GET and POST request methods have the following differences:

  • The parameters in the GET request are contained in the URL, and the data can be seen in the URL, while the URL of the POST request does not contain the data. The data is transmitted through the form, and will be contained in the request body.
  • GET requests submit a maximum of 1024 bytes of data, while POST has no limit.
  • K1 = XXX&K 2= YYY&K 3= ZZZ. The parameters of the POST request are stored in the request body, which can be viewed by the browser and stored in form data. The parameters of the GET request are placed directly after the URL.

POST:

The GET:

Here is GitChat website (gitbook.cn/) to demonstrate the operation process:

Go to the web >>> right mouse button (or F12) to enter developer mode >>> Clear developer list >>> (wechat) Scan code login >>> Network >> All >>> Find POST request

Next, let’s take a look at the GIF demo:

What’s in Response?

1. The request URL

URL full name uniform resource locator, such as a web document, an image, a video, etc., can be uniquely identified by URL.

The code is as follows:

# url encoding
Picture # https://www.baidu.com/s?wd=
# The image will be encoded
Copy the code

The code is as follows:

The loading process of the page is as follows:
To load a web page, you usually load the document document first.
# When parsing the document document, it will initiate a request to download the image for the hyperlink when encountering the link
Copy the code

2. Request header

  • User-agent: If there is no user-Agent client configuration in the request header, the server may treat you as an invalid User host
  • Cookies: Cookies are used to save login information

You can always add headers to a crawler.

3. Request body

  • If GET is used, the request body has no content
  • In POST mode, the request body is Format Data
  • Login window, file upload, etc., information is attached to the request body
  • Enter the wrong username and password for login, then submit, you can see the POST, the page will usually jump after the correct login, can not capture the POST

5.3 Summary crawler

The parable of a reptile:

If we compare the Internet to a large spider web, the data on a computer is a prey on the web, and a crawler is a small spider that crawls along the web to grab the prey/data it wants.

Definition of crawler:

Initiate a request to the website, after obtaining resources analysis and extraction of useful data procedures.

Value of a reptile:

The most valuable thing in the Internet is data, such as commodity information of Tmall, rental information of Lianjia and securities investment information of Xuebao, etc. These data represent the real money and silver of various industries. It can be said that the one who has the first-hand data of the industry becomes the master of the whole industry.

If the data of the entire Internet is compared to a treasure, then our crawler course is to teach you how to efficiently mine the treasure, master crawler skills, you will become the boss behind all Internet information companies, in other words, they are providing you with valuable data for free.

Basic operations for Requests and BeautifulSoup libraries

Did you have these problems before?

  1. What kind of data can you capture?
  2. How do we do that?
  3. Why is what I catch different from what the browser sees?
  4. How to solve JavaScript rendering problems?
  5. How can data be saved?

I think the above questions are more or less confusing to you, or not very understanding. This takes you to the basic operations of the Requests and BeautifulSoup libraries.

In order to accommodate the vast majority of children with zero or no foundation, I will focus on basic operations for Requests and BeautifulSoup libraries. Nani? Leave out the above points? Don’t worry, I will answer some of the above questions briefly, and I will share them with you on Chat later. Stay tuned!

What kind of data can you capture?

  • Web page text: such as HTML documents and JSON text.
  • Picture: Get the binary file, save as a picture format.
  • Video: Both binary files can be saved as video files.
  • Other: as long as can request, can obtain.

How do we do that?

  • Deal directly with
  • JSON parsing
  • Regular expression
  • BeautifulSoup
  • PyQuery
  • XPath

Why is what I catch different from what the browser sees?

Dynamic loading and JS rendering techniques are not the same.

How to solve JavaScript rendering problems?

  • Analyzing Ajax requests
  • Selenium/WebDriver
  • Splash
  • PyV8, Ghost. Py

How can data be saved?

  • Text: plain text, JSON, XML, etc
  • Relational databases, such as MySQL, Oracle, and SQL Server, are stored in structured tables
  • Non-relational database: storage in the key-value format, such as MongoDB and Redis
  • Binary files: such as pictures, videos, and audio files can be directly saved in a specific format

Ok, solve the problem above. Let’s take a look at Requets BeautifulSoup, the foundation of the Python network module.

6.1 Requests

Requests to introduce

Official documents:

docs.python-requests.org/en/master/

Crawler, to review the installation mode, type in command line:

# Windows users
pip install requests

# Mac user input
pip3 install requests

Copy the code
A simple example

  • Status_code: indicates the status code
  • Encoding: Indicates the encoding mode
  • Cookies, cookies

Output result:

Status code

Cookies

What are Cookies and what are they used for

What is a Cookie? Simply put, it’s a set of data that records your username and password, giving you direct access to your account space. It’s no use talking. Let’s try it ourselves. Namely, the following points:

  1. Session state management (such as user login status, shopping cart, game score, or other information that needs to be logged)
  2. Personalization (such as user-defined Settings, themes, etc.)
  3. Browser behavior tracking (e.g. tracking and analyzing user behavior, etc.)

Simulation login collection tutorial, after the Chat hand to hand with you oh!

The Requests library provides all of the basic Requests for HTTP:

We’ll focus on basic GET requests.

A GET request

Params parameters are available:

Output result:

A POST request

Add a parameter to POST using the data parameter:

Uploading files:

Get/send cookies

To obtain cookies from www.baidu.com:

import requests

url = 'https://www.baidu.com'

req = requests.get(url)
coo = req.cookies
print(coo)

for key in coo.keys():
	print(key)
	print(coo[key])

# output
<RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>
BDORZ
27315
Copy the code

Send the cookies:

url = 'http://httpbin.org/cookies'
cookies = dict(my_cookie= 'test')
req=requests.get(url,cookies=cookies)
print(req.text)
Copy the code
{
"cookies": {"my_cookie":"test"}}Copy the code

Timeout configuration: Use the timeout variable to configure the maximum request time.

import requests
requests.get('http://www.baidu.com', timeout=0.001)
Copy the code

6.2 Use web crawler Requests

Why Requests?

Requests allows you to send all-natural, plant-fed HTTP/1.1 Requests without manual labor. You don’t need to manually add query strings to urls, nor do you need to form encode the POST data. Keep-alive and HTTP connection pooling are 100% automated, powered by urllib3 embedded in Requests.

Why study Requests?

For starters, the main reasons are:

  • In our study, we often go back to the network to find relevant resources, and FOR old drivers, I can guarantee that there are enough users of Requests. For entry, you have encountered these bugs that your predecessors have encountered, so there are more solutions to the problems.
  • Requests has a wealth of learning resources on the Internet. A Search on Baidu for “Requests crawlers” yielded tens of millions of results. This means that the technology for Requests is mature. Especially for beginners, a rich learning material can reduce the number of “digging” and “pit” times in learning;
  • Requests is officially available in Chinese. This is the best resource for newcomers, especially those whose English skills are not very good. The documentation on the official website provides detailed and very accurate function definitions and instructions. If there is a problem in the development process, Baidu, Google, Stack Overflow… When all search methods have been tried and nothing can solve the problem, looking through official documents is the safest and fastest solution.
Requests the early experience

This was done before the installation, but it will be repeated here.

The code is as follows:

pip install requests
Copy the code

Try this code:

# The following code is easy to understand.
import requests
The first line of code imports the Requests library, which we use mostly, as well as from... import...

url = "https://www.qiushibaike.com/"
The second line defines the URL we want to crawl

req = requests.get(url)
In line 3, we call the get() method in Requests directly, which accesses a web page via get:

web_info = req.text
# When we make a GET request, Requests makes an educated guess about the encoding based on the HTTP header, so when we access req.text, Requests will parse with its inferred text encoding.

print(web_info)
# Print it.
Copy the code

Run:

Custom request headers

What is a request header? HTTP request headers. When an HTTP client (such as a browser) sends a request to the server, it must specify the request type (usually GET or POST). The client can also choose to send additional headers if necessary. As we mentioned above, we usually write this when we write a crawler, which can be interpreted as you have to dress your crawler, but can not directly “naked” to someone else’s website.

The figure above is a typical request header. In a Request, we can easily construct our own Request headers. The code is as follows:

We usually copy the request headers directly from the browser

header = {'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'}

# It's easy to use
import requests
url = "https://www.qiushibaike.com/"
# Just use it that way
req = requests.get(url, headers = header)
web_info = req.text
print(web_info)
Copy the code

6.3 Beautiful Soup

Beautiful Soup is introduced

www.crummy.com/software/Be…

Crawler sharp tools, excellent parsing tools.

As mentioned before, let’s review the code:

Install mode, enter it in the command line
pip install lxml
pip install beautifulsoup4

# Mac user input
pip3 install lxml
pip3 install beautifulsoup4
Copy the code

(Photo credit network)

Note how to import the module name:

What we need is the BeautifulSoup module in BS4.

An example: prettify() formats output

BeautifulSoup starts quickly

Without further ado, let’s use an example to explain in detail. The code is as follows:

from bs4 import BeautifulSoup
import requests
url = "https://www.qiushibaike.com/"
header = {'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'}
req = requests.get(url, headers = header)
soup = BeautifulSoup(req.text, 'lxml')
title = soup.title
print(title)
Copy the code
soup = BeautifulSoup(req.text,'lxml')
Copy the code

Use BeautifulSoup to parse the req.text page content that we used Requests to crawl, and use the LXML parser to parse it.

This is the most common line of code we use when working with the BeautifulSoup framework. If you really don’t understand how it works (ok, just getting started).

With this line of code, we get a BeautifulSoup object. All of our subsequent web page fetching will be done by manipulating this object. BeautifulSoup parses complex HTML code into a tree structure. Each node is an operable Python object, of which there are four common types.

Four categories of objects:

  • Tag
  • NavigableString
  • BeautifulSoup
  • Comment

Next, we will introduce them one by one.

1. Tag

A Tag is a Tag in HTML.

** Note: ** returns the first compliant tag (even if there are multiple compliant tags in the HTML).

This tag is also the basis of the web page I wrote about earlier!

Such as:

title = soup.title
Copy the code

The code above is to get the title of the site. The result is as follows:

Bingo! We can get the tag information directly from the soup. Tag in the HTML.

Let’s take a look at a particular Tag in an HTML web page.

<meta name="keywords" content="Joke, joke, joke, joke, joke, joke, joke." />
<meta name="description" content="Qiushi Encyclopedia is an original qiushi joke sharing community, qiushi hundred users to share funny jokes, funny pictures, are qiushi friends the most precious happy experience, laugh qiushi jokes only in qiushi encyclopedia!"/>
Copy the code

In general, description and keywords are one of the key pieces of information on a web page. Specifically, if you just want to get the general content of this page, then we can directly get the information in these two tags. We can run the following code:

from bs4 import BeautifulSoup
import requests
url = "https://www.qiushibaike.com/"
header = {'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'}
req = requests.get(url, headers = header)
soup = BeautifulSoup(req.text, 'lxml')
title = soup.title
description = soup.find(attrs={"name": "description"})
keywords = soup.find(attrs={"name": "keywords"})
print(title)
print(description)
print(keywords)
Copy the code

Running results:

You’ll notice that everything you get above is tagged.

2. NavigableString

I hope you can type the code yourself to feel the feeling:

  • Attrs: Gets the element attribute of the tag
  • The get() method: gets the value of an attribute of the tag

You can modify or delete these attributes and contents by modifying the dictionary.

print(soup.a.attrs)

print(soup.a['title'])

print(soup.a.get('title'))

soup.a['title'] = "a new title"
Copy the code

NavigableString retrieves the contents of a tag:

As you can see from the above code, if you can get the tag, how do you get the content of the tag? Very simple, careful little partners will certainly be able to find me above the string can be!

3. Comment

The Comment object is a special type of NavigableString object, but if we don’t process the Comment object when it appears in an HTML document, we might have problems later on. HTML can be used to add a piece of content that is not currently rendered through a web page.

Let’s look at this HTML:

Do the following:

Let’s look at the output:

We’ll see that there’s a comment.

4. BeautifulSoup

The BeautifulSoup object represents the entire content of a document. Most of the time, you can think of it as a Tag object that supports traversing the document tree and searching for most of the methods described in the document tree. The code is as follows:

print(type(soup))
print(soup.name)
print(soup.attrs)
Copy the code

Document tree – Direct child nodes

Notice the difference between.contents and.children.

I want you to look at the brain diagram above.

Document tree – All descendant nodes

.descendants lists all descendants of a tag, which can be handled by the for loop:

Document tree – Node content

The output of soup. A. soup is the same as soup. P.soup.

Note: What does the.string method return if the tag contains more than one node that can call.string? None.

Attention! Spaces and newlines count as one node!

What if we want to get multiple contents under a tag?

.strings or.stripped_strings:

The difference between. Strings and. Stripped_strings is that. Stripped_strings can remove unnecessary blank content.

Search the document tree — find_all

The code is as follows:

print(soup.find_all('a'))
Copy the code

Find_all () :

  • Tag names, such as A, P, H1, and so on
  • Lists, such as [‘ a ‘, ‘p’]
  • True, find all child nodes
  • Regular expression

Keyward parameters:

Find_all (Attribute name in tag = attribute value)

If you’re looking for class, make sure you write class_ because class is a Python keyword.

6.4 CSS selectors

The code is as follows:

soup.select('title')
soup.find_all('title')

soup.select('a')
soup.find_all('a')

soup.select('.fooyer')
soup.find_all(True, class_ = 'footer')

soup.select('p #link')
soup.find_all('p'.id = 'link')

soup.select('head > title')
soup.head.find_all('title')

soup.select('a[href = "www.baidu.com"]')
soup.find_all('a', href = 'www.baidu.com') soup. Select () filters the element, and returnslist
Copy the code

Grammar rules:

  • The label name is not decorated
  • Dot before class.
  • Id name before#

6.5 BeautifulSoup of actual combat

There is so much theoretical knowledge involved that it is “too boring”. I personally prefer actual practice, but theory is also very important. Now let’s learn BeautifulSoup in actual practice.

The website we climbed today is The Encyclopedia of Qiushi: www.qiushibaike.com.

We’re going to crawl the text jokes on this site.

Here are the jokes that we’re going to crawl.

1. Analyze the URL crawling rule

Because, we have to grab several pages, so first analyze the URL law of the text in this webpage:

  • The first page: www.qiushibaike.com/text/
  • The second page: www.qiushibaike.com/text/page/2…
  • The third page: www.qiushibaike.com/text/page/3…

It’s pretty straightforward, the page changes in the text are the numbers after the page so we can do this with a for loop.

A quick fact:

The basic operation runs the following code:

import requests
from bs4 import BeautifulSoup

header = {'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'}
url = 'https://www.qiushibaike.com'

for page in range(1.2):
    req = requests.get(url + f'/text/page/{page}/ ', headers = header)
    soup = BeautifulSoup(req.text, 'lxml')
    title = soup.title.string
    # print(soup)
    print(title)
Copy the code

Sublime Text3 runs code with Control + B shortcuts:

2. Use the select() function to locate the specified information

As shown in the image above, we want to crawl “This joke content” in the upper right corner. How do we do this?

The code is as follows:

laugh_text = soup.select('#qiushi_tag_122135715 > a > div')
Copy the code

Passing a string argument to the.select() method of a Tag or BeautifulSoup object finds the Tag using CSS selector syntax.

Using the.select() method helps us locate the specified Tag. So, how do we determine this designated location? Let’s take a look at the GIF!

To be specific, we first place the mouse over the text to be located, and then perform the following operations:

  1. Right click and select “Check”
  2. In the pop-up detection platform, select the corresponding text
  3. Using the locator, look at the GIF and I locate the data I want to operate
  4. Next, let’s analyze and write CSS selectors

We found that it was in the SPAN tag under class = Content in a div. So, let’s write CSS like this and try it out:

.content span
Copy the code

Run the following code:

import requests
from bs4 import BeautifulSoup

header = {'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'}
url = 'https://www.qiushibaike.com'

for page in range(1.2):
    req = requests.get(url + f'/text/page/{page}/ ', headers = header)
    soup = BeautifulSoup(req.text, 'lxml')
    laugh_text = soup.select('.content span')
    print(laugh_text)
    # title = soup.title.string
    # print(soup)
    # print(title)
Copy the code

Let’s run it:

If you want only the text contained in a Tag, you can use the get_text() method, which retrieves all the text contained in a Tag, including the contents of descendant tags.

Let’s try this by adding get_text() :

import requests
from bs4 import BeautifulSoup

header = {'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'}
url = 'https://www.qiushibaike.com'

for page in range(1.5):
    req = requests.get(url + f'/text/page/{page}/ ', headers = header)
    soup = BeautifulSoup(req.text, 'lxml')
    laugh_text = soup.select('.content span')
    Because we get a list, we need to iterate through it with a for loop
    for laugh in laugh_text:
        print(laugh.get_text())
    # print(laugh_text)
    # title = soup.title.string
    # print(soup)
    # print(title)
Copy the code

Of course, we can also use text, strings, stripped_string directly.

3. Use.get() to get the specified property

Suppose you have the following code in your HTML:

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
Copy the code

How do we access the text inside? That’s right, just use the.get_text() function.

So, what if we just want to get its URL? Or what about its corresponding class name? This requires us to use the.get() function.

Select to and then get:

# .get("class")
# .get("href")
Copy the code

Practical summary

  1. How do I get the corresponding location of the page information in the HTML, how do I get the corresponding selector using Chrome
  2. Use of the.get_text() function
  3. Of course, we can also use text, strings, stripPED_string
  4. Use of the.get() function
  5. Be sure to browse the official documentation for BeautifulSoup

Seven, crawler library Scrapy foundation

(Photo from Internet)

7.1 Scrapy principle and installation

Scrapy principle introduction

(Picture from Internet)

  1. Scrapy Engine: The heart of Scrapy
  2. Scheduler: A leader assigns tasks and gives full play to the functions of each employee
  3. Downloader: A Downloader
  4. The Spiders: The former Spiders
  5. Item Pipeline: Project Pipeline, cleaning area

Specific functions: [2]

  • Engine: Processing the data flow of the whole system, triggering things, is the core of the entire framework.
  • Item: the item that defines the data structure for the crawl result, which is assigned to the item object.
  • Scheduler: a Scheduler that receives and queues requests from the engine and serves the request to the engine when it requests it again. (It can be understood that the scheduler assigns tasks >>> everyone to do what things, who first and who last, and then feedback the results to the engine.
  • Downloader: a device that downloads web content and returns it to the spider. (It can be understood that after downloading, the download machine returns the downloaded data to the spider to see if THE data I downloaded is correct, whether the content I want to download is missing, etc.).
  • Spiders define the logic of the Spiders and the rules of web page parsing, which is responsible for parsing responses and generating extraction results and new requests. (Another way of saying it: a Spider is which site you are requesting and which part of the site you need to crawl, just as you crawled the intern: job title, salary, location, etc.)
  • Item Pipeline: An Item Pipeline that processes items extracted from web pages by spiders. Its main tasks are cleaning, validating, and storing data.
  • Downloader Middlewares: A hook framework between the engine and the Downloader that handles requests and responses between the engine and the Downloader.
  • Spider Middlewares: Spider middleware, a hook framework that sits between the engine and the Spider and handles the response and output to the Spider as well as its new requests. (Hook you can think of it as: the engine is tied to the spider.)

Overall process:

Spriders >> Engine >> Scheduler >> Engine >> Downloader >>> Sprider >>> (if new URL is returned) Downloader >>> Otherwise Item Pipline

Crawl process:

For each URL

Scheduler >>> Downloader >>> Spider >>> Scheduler >>> Downloader >>> Spider >>>

  1. If a new URL is returned, the Scheduler is returned
  2. If it is data that needs to be saved, it will be put into the item pipeline

Extension: Data Flow [3]

The data flow in Scrapy is controlled by the Engine as follows:

  1. First, open a web site, find the Spider that handles the site, and request the first URL to crawl from that Spider.
  2. The Engine retrieves the first URL to crawl from a Spider and uses Scheduler to schedule it as a Request.
  3. The Scheduler returns the Engine of the next URL to climb, which forwards the URL to the Donwloader via the Downloader Middlewares.
  4. Once the page is downloaded, Downloader generates a Response for the page and sends it to Engine via Downloader Middlewares;
  5. The Engine receives the Response from the downloader and sends it to the Spider through the Spider Middlewares;
  6. The Spider handles the Response and returns the extracted Item and the new Request to the Engine.
  7. The Engine feeds the Item returned by the Spider to the Item Pipeline and the new Request to the Scheduler;
  8. Repeat steps 2 through 8 and, knowing there are no more requests in the Scheduler, Engine closes the site and the crawl ends.

Through the collaboration of multiple components, different components to perform different tasks, and support for asynchronous processing, Scrapy maximizes the use of network bandwidth, greatly improving the efficiency of data crawling and processing.

Scrapy installed

Methods a

Windows:

pip install scrapy
Copy the code

Mac:

xcode-select --install  
pip3 install scrapy
Copy the code

Method two – step installation, install dependencies first

1. Install the LXML

pip install lxml 
Copy the code

Install pyOpenSSL (Wheel) :

Pypi.org/project/pyO…

PIP Intall the name of the file that was just downloadedCopy the code

2. Install the Twisted

www.lfd.uci.edu/~gohlke/pyt…

The version of this page will be updated, pay attention to select the appropriate version!

PIP Intall the name of the file that was just downloadedCopy the code

3. Install PyWin32

pip install pywin32
Copy the code

With dependencies installed, you can install Scrapy.

pip install scrapy
Copy the code

7.2 introduction to Scrapy

Create a project

Go to the directory where you want to store the code (on the command line) and type the following code:

scrapy startproject tutorial
Copy the code

The directory structure

With the preparation done, it’s time for us to get on with our subject.

First Scrapy crawler

We’re going to scrapl: quotes.toscrape.com/

Go to the website and have a look:

First of all, we can of course use the previous way to climb the monk climb, directly on the code:

# -*8 coding: utf-8 -*-
# @author: Huang Heirloom
# @Corporation: AI Yuechuang
# @Version:  1.0
# @ Function: Function
# @DateTime: 2019-07-26 12:24:51
# = = = = = = = = = = = = = = = = = = = = = = = = = = = =
import requests
from bs4 import BeautifulSoup

header = {'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'}

url = 'http://quotes.toscrape.com/'
# Take the first 5 pages:
for index, page in enumerate(range(1.6)):
    req = requests.get(url + f'page/{page}/ ', headers = header)
    soup = BeautifulSoup(req.text, 'lxml')
    # print(soup)
    title = soup.title.string
    print(F, 'this is the first{index + 1}", titled:{title}') # Although the title is the same, but write it so that you can know the page later
    # Article content:
    texts = soup.select('.text')
    for index, text in enumerate(texts):
        print(F, 'this is the first{index + 1}Article. \ n{text.string}')
    print('\n\n\n')  Add a blank line to each page, or you can use another method
Copy the code

Output results :(partial output results)

This part is left to a small extension, try to connect to baidu translation API to obtain the data directly translated!

So let’s try it out with Scrapy.

1. Create the spiders

A Spider is a self-defined class that Scrapy uses to crawl content from a web page and parse the results. The class must inherit from Scrapy’s Spider class scrapy.Spider, define the name of the Spider, its starting request, and how to handle the results of the crawl.

You can also create a Spider using the command line. For example, to generate the Quote Spider, run the following command:

cdTutorial scrapy genspider quotes quotes.toscrape.comCopy the code

Parse each step:

  1. cd tutorialGo to the tutorial folder you just created, and executescrapy genspiderCommand.
  2. The first parameter is the name of the Spider and the second parameter is the domain name of the web site.
  3. After executing, the Spider folder contains a new quotes.py, which is the Spider you just created:

Click inside to see:

Let’s parse each part of the generated content:

Shortcuts to start_requests:

Page: Gets the page number

Data capture will naturally have data storage, here by the way to talk about data storage.

File storage in Python

Python files are opened using “WB”, and writing strings will cause an error because this opens a file in binary format only for writing, overwriting it if it already exists, and creating a new file if it doesn’t.

Therefore, the written character type must be binary, for example:

f.write("hello".encode('ascii'))
Copy the code

ASCII cannot be used if the encoding range is not 128, as in:

Fh. Write (" Chinese characters ". Encode (' utf-8 '))Copy the code

Let’s define a function called start_requests as follows:

  • Yield in the figure: First, if you don’t have a rudimentary idea of what yield is, you can think of it as “return.” This is intuitive. It is first a return. Return means to return a value in a program, after which the program will no longer run. Think of a return as part of a generator (functions with yields are true iterators). Okay, so if this doesn’t make sense to you, just think of yield as return.
  • Callback: indicates a callback.

Note: Different spiders cannot have the same name. These function names cannot be arbitrarily named!

Run the code and experience the experience.

Note: Be sure to go to the root directory! In our case, this would be tutorials/, or its error would be reported.

The code is as follows:

scrapy crawl quotes
Copy the code

The lower half of the output:

Analysis of crawler results

Looking at our root directory, we found two new files HTML:

I just got the body of the entire web page. Now, we can use the following code:

scrapy shell "http://quotes.toscrape.com/page/2/" Enter scrapy interaction modeCopy the code

Note:

  • Enter in the root directory
  • Urls must be enclosed in double quotation marks, not single quotation marks, otherwise an error will be reported:

Correct input results:

After entering interactive mode, extract data:

Response.css (‘ title ‘) gets a CSS selector result.

Extract the HTML element: response.css(‘ title ‘).extract()

.extract() returns a list and only wants to process the first result:

Must know must know: In addition to CSS, the Scrapy selector also supports Xpath.

Extract the data

Class = “quote” div can be omitted. The code is as follows:

response.css("div.quote")
Copy the code

Let’s extract Quote and content together:

response.css("div.quote").extract()
# extract() Extract the actual content
Copy the code

Note: The two colons text >>> ::text >>> extract the text from this element.

No add: : text

: : text

From the above analysis, adding ::text can produce plain text content, while not adding ::text can produce labeled text content. The following operation examples, carefully observe and learn their own direct summary.

Extract Quote and save it to TXT file
import scrapy
class QuotesSpider(scrapy.Spider) :
    name = 'quotes'
    allowed_domains = ['quotes.toscrape.com']
    # start_urls = [
    # 'http://quotes.toscrape.com/page/1/',
    # 'http://quotes.toscrape.com/page/2/'
    #]

    def start_requests(self) :
        urls = [
            'http://quotes.toscrape.com/page/1/'.'http://quotes.toscrape.com/page/2/'
        ]
        for url in urls:
            yield scrapy.Request(url = url, callback = self.parse)

    def parse(self, response) :
        page = response.url.split("/")[-2] # retrieve the current page we have climbed to!
        # file_name = f'quotes-{page}.html'
        file_name = f'quotes-{page}.txt'
        with open(file_name, 'wb') as f:
            quotes = response.css(".quote")  # The resulting quotes are a list, so you need to iterate over the list
            for index, quote in enumerate(quotes):
                text = quote.css("span.text::text").extract_first() # Filter text
                author = quote.css('small.author::text').extract_first()
                tags = quote.css(".tags .tag::text").extract()
                Encode (f"No.{(index + 1).encode()}")
                f.write("No.{}".format(index + 1).encode())
                f.write("\r\n".encode())
                f.write(text.encode())
                f.write("\r\n".encode())
                f.write("By {}".format(author).encode())
                f.write("\r\n".encode())
                tags_str = ' '
                for tag in tags:
                    tags_str += tag + ","
                f.write(("Tags: " + tags_str).encode())
                f.write("\r\n".encode())
                f.write(("-"*20).encode())
                f.write("\r\n".encode())
        self.log(f"Saved file{file_name}")
Copy the code
summary
  1. Response.css (“div.quote”) gets a list of data types, and we can’t operate on the list when we use the CSS selector to select the action again. Therefore, you need to extract data from bit 0 (of course, other bits are also possible).
  2. The two colons text >>> ::text >>> extract the text of this element,
  3. Extract () ::text
  4. Extract_first () yields the first element of the list,

7.3 Scrapy’s Interactive mode — Extras

Learn more about CSS selection in Scrapy.

Enter Scrapy’s interactive mode

The code is as follows:

This Url is the Url that the scrapy shell crawls toCopy the code

Let’s do a quick analysis.

1. Response is the content of the web page by Scrapy, and the code is as follows:

response.css('.text')
Copy the code

Data obtained:

Analysis:

  1. Get a list of data types
  2. Crawl to the desired text tag
  3. And we’ll find that the total number of pages we’ve climbed are all in a list

So what are we going to extract from it? Add a extract ().

response.css(".text").extract()
Copy the code

Example output:

We just want the first element of this list to have two methods.

Method 1: Extract_first ()

response.css(".text").extract_first()
Copy the code

Method 2: is similar to the list operation method

Review the list
Create a list
list1 = ['LiLei'.'Make'.'AIYC']
print(list1[0])  Get the 0 element of the list
print(list1[1])  Get the 1-bit element of the list
print(list1[2])  Get the 2-bit element of the list

>>> LiLei
>>> Make
>>> AIYC
Copy the code

Ok, I’ll review and get to the point. Method two:

response.css(".text").extract()[0]  Get list 0 as element
Copy the code

The operation process of the two methods:

But, at this point, you’ll find that either method one or method two, Response.css (“.text”).extract_first() or response.css(“.text”).extract()[0]

So how to solve this problem?

Use: : text:

Now we see that we have text in the list, and the data type is of course a list, which contains each element of a string.

Extract the data

Above has been all mentioned, let’s go through the general:

  1. Extract Quote and content
  2. The author information
  3. Extract the corresponding tag

7.4 Extract Quote and save it in TXT file

1. Let’s extract Quote and content together

Run example:

So let’s just extract the text.

Original Webpage analysis:

Note: div can be omitted.

Run example:

2. Let’s extract author information together

The same steps, first analyze the page source:

Run example:

3. Extract the tag

Also analyze the web page first:

The running process (of course, you can abbreviate the CSS selector, but I’m here to make the zero-based one easier to understand) :

Of course, you can also do this:

Next we officially start netease news combat!

Scrapy news

8.1 Creating a Project

The code is as follows:

Create a new project
# 1\. Launch the console in the target directory
# 2\. Then, type the following command in the command line:
# scrapy startProject
# change the project name to: news

# Final input:
scrapy startproject news
Copy the code

As follows:

https://mp.weixin.qq.com/s/yBkXGT6dFgg46WeaZ18rjA
Copy the code

8.2 create spiders

Note: the name of the Spider created here cannot be the same as the name of the project, otherwise an error will be reported!!

The code is as follows:

Run the command line command in the root directory of the project.
# scrapy genspider scrapy genspider
# Here we create a project named news163
Domain name: news.163.com
# note:
# 1\. Do not prefix domain names with protocols here: https:// or add WWW
# 2\. Scrapy is automatically added when generating spiders

# Final input:
scrapy genspider news163 news163..com

# extension
scrapy genspider -t crawl news163 news163..com
Copy the code

As follows:

Actual folder file picture:

FIG. 1

Figure 2

8.3 Go to the created Spider >>> news163.py

As shown in figure:

Parse each section:

  • Import scrapy: Imports the scrapy library
  • Class News163Spider(scrapy. Spider) : This is a class that inherits from the parent class Scrapy
  • Name: The name of the crawler is News163
  • Allowed_domains = [‘ news.163.com ‘] : This is the domain set up for the crawler to crawl, and if the initial or subsequent requests are not in this domain, the request links will be filtered out
  • Start_urls = [‘ news.163.com/ ‘]
  • Parse: start_urls The first passed argument runs the parse function first by default. We will also need to distinguish between parse1 and parse2 functions when we define them ourselves

Here I think it’s better to use pictures:

Extension, Xpath 8.4

XPath definition: [4]

Is a language for finding information in XML documents. XPath can be used to traverse elements and attributes in AN XML document. XPath is a major element of the W3C XSLT standard, and both XQuery and XPointer are built on XPath expressions.

Therefore, an understanding of XPath is the foundation of many advanced XML applications.

What is XPath?

  • XPath uses path expressions to navigate through XML documents
  • XPath contains a library of standard functions
  • XPath is a major element in XSLT
  • XPath is a W3C standard

8.5 the items. Py

Keep this in mind, this is the attribute we want to get for each page we crawl:

8.6 spiders. Py

Import libraries:

Resolution:

  • Import scrapy: Import this scrapy framework
  • From scrapy.linkextractors import LinkExtractor
  • Spiders import spiders from scrapy.spiders import spiders from scrapy.spiders import spiders, Rule. The Rule stands for the Rule of crawling
  • From news.items import NewsItem: Import the items. Py we just wrote

Extension:

  • CralwSpider: CrawlSpider is a derived class of Spider
  • LinkExtractor: LinkExtractor is an object that extracts links to be followed from web pages (scrapy.http.response)

Note:

  1. Our ExampleSpider class must inherit from CrawlSpider
  2. Callback do not use parse and delete the automatically generated parse

8.7 Writing URL Crawling Rules

  • Rules of the Rule:
  • LinkExtractor link extraction, that is, this is the extraction of links, then the extraction of the content of the link must be required! Which brings us to point 3
  • Allow: Indicates the url that is allowed to be crawled
  • Callback: But crawls to a link that meets the rule (allow) (LinkExtractor), then gets the link and calls the function
  • We call this form of parse_news a callback

8.8 Regular expressions

Break through to understand regular expressions and analyze target urls.

Here we use netease News:

https://news.163.com/19/0801/15/ELGM0GIV000187VE.html
https://news.163.com/19/0801/20/ELH57GI0000189FH.html
Copy the code

Let’s remove the previous domain name:

/19/0801/15/ELGM0GIV000187VE.html
/19/0801/20/ELH57GI0000189FH.html
Copy the code

Then we write the regular expression (if the regular expression matches the transformation, we write the original one) :

/19/0801/\w+/.*? .html /19/0801/ : is the date >>> August 01, 19Copy the code

\w We know through the analysis of the website, after the number began to be different, and is a number we use \w to match letters (upper and lower case letters), numbers and underscores; And we find two digits after the date, so we add +; >>>.*? .

8.9 parse_news

Parse_news is our callback function. To prevent code bloat, we write these class functions. The code is as follows:

# in the crawler, my crawler name is news163.py

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from news.items import NewsItem

class News163Spider(CrawlSpider) :
    name = 'news163'
    allowed_domains = ['news.163.com']
    start_urls = ['http://news.163.com/']

    rules = (
        Rule(LinkExtractor(allow=r'/19/0801/\w+/.*? .html'), callback='parse_item', follow=True),def parse_item(self, response) :
        item = NewsItem()      # instantiation
        item['news_thread'] = response.url.strip().split('/')[-1] [: -5]
        self.get_title(response, item)
        self.get_time(response, item)
        self.get_source(response, item)
        self.get_url(response, item)
        self.get_source_url(response, item)
        self.get_text(response, item)
        return item
Copy the code
# check whether the obtained data is null!
if title:
    pass

# is equivalent to
if title is not None:
    pass

# remove whitespace
str2 = " Runoob "   # Remove leading and trailing Spaces
print(str2.strip())
>>>Runoob

# is also going to come down here
Copy the code

The Python strip() method is used to remove characters (Spaces or newlines by default) or character sequences specified at the beginning and end of a string.

** Note: ** This method can only delete the beginning or the end of the character, not the middle part of the character.

To help you understand, examples of code:

str = "00000003210Runoob01230000000"
print(str.strip( '0' ))  # remove the first and last character 0

str2 = " Runoob "   # Remove leading and trailing Spaces
print(str2.strip())

str = "123abcrunoob321"
print(str.strip( '12' ))  The # character sequence is 12

# output result
>>>3210Runoob0123
>>>Runoob
>>>3abcrunoob3
Copy the code

Ok, with the above foundation, we can formally write each part of the function!

Obtain the time, here we first through a picture directly to reflect the steps of web analysis:

Then we can write the following code:

    def get_title(self, response, item) :
        title = response.css('.post_content_main h1').extract()
        print(The '*'*10)
        if title:
            item['news_title'] = title[0]
Copy the code

Then we can continue to write to get the time, time has a special point:

  1. Analysis of web page
  2. Write the code
  3. Observe the preliminary operation results
  4. Make a change

Analysis of web page

Write corresponding code:

def get_time(self, response, item) :
    time = response.css('div.post_time_source').extract()
    if time:
        item['news_time'] = time[0]
Copy the code

Run code:

Our analysis will find that the effect is not very good, although the time is successfully climbed, but there are extra parts that we do not need. The code is as follows:

# '2019-08-02 09:56:00\u3000'
# 1\. There is too much space before the date
# 2\. Date followed by: \u3000 source:

How to solve these problems?

# Remove Spaces before and after strings
# .strip()

# Remove: \u3000 source:
# Two methods, method 1:
# using the slicing method, we have 5 Spaces after the date when we haven't climbed the time yet;
# So, do this: [:-5]

# Method 2:
The.replace() built-in function operates on strings;
# replace('\u3000 source :',')

# replace() review:
str.replace(old, new[, max]) >>>old -- the substring to be replaced. >>>new -- new string, used to replace old substring. >>>max-- Optional string, no more than replacementmaxCopy the code

Note: When selecting response.css(), add ::text inside and: extract() outside. Current complete Spider code:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from news.items import NewsItem

class News163Spider(CrawlSpider) :
    name = 'news163'
    allowed_domains = ['news.163.com']
    start_urls = ['http://news.163.com/']

    rules = (
        Rule(LinkExtractor(allow=r'/19/08\w+/\w+/.*? .html'), callback='parse_item', follow=True),def parse_item(self, response) :
        item = NewsItem()      # instantiation
        item['news_thread'] = response.url.strip().split('/')[-1] [: -5]
        self.get_title(response, item)
        The function we want to call is our own function, so we don't need to look for it.
        self.get_time(response, item)
        # self.get_source(response, item)
        # self.get_url(response, item)
        # self.get_source_url(response, item)
        # self.get_body(response, item)
        return item

    def get_title(self, response, item) :
        title = response.css('.post_content_main h1::text').extract()
        print(The '*'*10)
        if title:
            item['news_title'] = title[0]

    def get_time(self, response, item) :
        time = response.css('div.post_time_source::text').extract()
        if time:
            item['news_time'] = time[0].strip().replace('\ u3000 sources:.' ')

Copy the code

As can be seen from the above, we still have the following data that we need:

self.get_source(response, item) 
self.get_url(response, item) 
self.get_source_url(response, item) 
self.get_body(response, item)
Copy the code

So let’s go ahead and write.

Get a news source

Analysis of the original webpage:

The code is as follows:

# get_source(response, item)

def get_source(self, response, item) :
    source = response.css('#ne_article_source::text').extract()
    if source:
        item['news_source'] = 'Source:'+ source[0]
Copy the code

Get web address

def get_url(self, response, item) :
    url = response.url
    if url:
        item['news_url'] = url
Copy the code

Gets the URL of the page where the news is

self.get_source_url(response, item)

As can be seen from the page source code diagram, the news url of this source is the href attribute value. So how do we extract that address?

We use: attr().

 def get_source_url(self, response, item) :
  source_url = response.css('#ne_article_source::attr(href)').extract()
  if source_url:
      item['source_url'] = source_url[0]
Copy the code

Next comes the most important step, getting the news content of the web page:

The code is as follows:

    def get_body(self, response, item) :
        bodys = response.css('.post_text p::text').extract()
        # l = []
        # for body in bodys:
        # body = body.replace('\n','')
        # body = body.replace('\t','')
        # l.append(body)
        # item['news_body'] = str(l)
        if bodys:
            item['news_body'] = bodys
Copy the code

All the data we need now has been crawled.

Now, let’s write pipline.py.

8.10 pipelines. Py

First of all, we observe the character encoding of the web page to help the data storage after, otherwise it may be garbled.

Right-click source >>> Control + F >>> Enter charset

Then, let’s look at a basic property:

As can be seen from the picture, this page is composed of UTF-8. Write code:

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
from scrapy.exporters import CsvItemExporter

class NewsPipeline(object) :
    def __init__(self) :
        Self binds this property to the class and can be called directly later.
        Create file, write in binary mode
        self.file = open('news_data.csv'.'wb')

        Self. file, the encoding mode of the import
        self.exporter = CsvItemExporter(self.file, encoding = 'utf-8')
        # With that set up, Self. Exporter starts work.
        1. # indicates the beginning of the exportering process
        self.exporter.start_exporting()

        Open file, save crawl data
        There must be a close operation and saved data

    def close_spider(self, spider) :
        self.exporter.finish_exporting()
        self.file.close()

    def process_item(self, item, spider) :
        self.exporter.export_item(item)
        return item
Copy the code

8.11 Settings. Py

We also need to set up the settings.py file. Box-sizing: border-box! Important; word-break: inherit! Important; word-break: inherit! Important;”

ITEM_PIPELINES = {
   'news.pipelines.NewsPipeline': 300,}Copy the code

The following pictures:

Run code:

Run it in your project folder! Run the following command: scrapy crawl news163Copy the code

Extended skill: Quickly store files that have been crawled

# scrapy crawl -o Save file name
scrapy crawl news163 -o data.csv
Copy the code

With the above command, we don’t need to write middleware (pipeline.py) and setting.py.

Accelerated climbing:

# 1\. Configure maximum concurrent requests performed by Scrapy (default: 16)
#2\. CONCURRENT_REQUESTS = 32

In # 1 scrapy defaults to 1s to crawl 16 requests
# 2\. The number here is how many commands are issued simultaneously
# Suggest not to change this value, because if you change it, it will be easy for others to block your IP, after all, you can not bring traffic to others
Copy the code

8.12 Extra meal: Picture crawl

, please pay attention to the public number: AI Yue Chuang, read

Method 1: The code is as follows

import requests
from bs4 import BeautifulSoup

url = 'https://news.163.com/19/0804/16/ELOF3EB4000189FH.html'
req = requests.get(url)
soup = BeautifulSoup(req.text, 'lxml')

image_url = soup.select('div#endText .f_center img') [0].get('src')
with open('image.jpg'.'wb') as f:
    picture = requests.get(image_url)
    f.write(picture.content)
Copy the code

Method 2:

8.13 the homework

Climb the list of Full Anime List under www.animeshow.tv/ to obtain and store the animation information (in any format, including TXT, database, JSON, etc.), and the required data (animation name, animation address, type, broadcast time, status, genre, abstract).

Nine,

This Chat is basically over, thank you very much for watching, due to too much content and text, in order to help you better clarify your thoughts, improve the reading effect, the following is the Chat summary.

Front end:

  1. Domain name, web development tools and other front-end related knowledge
  2. Parsing of HTML major elements

Crawler foundation:

  1. The principle and basic flow of crawler
  2. Request and Response concepts, including content and common Request methods

Operations for the Requests and Beautifulsoup libraries:

  1. Capture and parse data
  2. FAQ Resolution
  3. Save the crawl data
  4. An in-depth introduction to Requets Beautifulsoup and Cookies, the foundation of the Python network module

Crawler professional library Scrapy foundation:

  1. Principle and installation of Scrapy
  2. Get started – climb Quotes to Scrape the website
  3. Introduction: Scrapy interaction mode

Deep: Scrapy combat – to get news

In order to relieve the pressure of reading for zero-base kids, we also introduced Python installation and basics at the beginning of Chat.

reference

Out of respect for originality, here are some of the blogs and books cited in this Chat:

  • [1] www.w3school.com.cn/xpath/index…
  • [2] “Python3 Web Crawler Development Practice” P468-P470- Cui Qingcai
  • [3] “Python3 Web Crawler Development Practice” P468-P470- Cui Qingcai
  • [4] www.w3school.com.cn/xpath/index…

Thanks again for watching, but if you’re interested in the front end after you’ve learned the basics of HTML, check out our next Chat, where we’ll cover Django in more detail. Welcome to our official account: AI Yuechuang.