This article is written by Tuque community member Canruo starry sky, welcome to join tuque community, create wonderful free technical tutorials together, to force the development of programming industry.

If you think we have done a good job, please remember to like + follow + comment 🥰🥰🥰 and encourage us to write a better tutorial 💪

An introduction

This is a series of tutorials based on actual combat, starting from the simplest crawler program, teach people to fish, detailed analysis of the program design ideas, a complete show crawler is how to step by step debugging to the final completion. Share all kinds of knowledge and skills about crawler, aiming to help everyone to understand crawler, design crawler, use crawler and finally enjoy all kinds of convenience brought by crawler in work and life.

Preliminary knowledge

  1. Basic knowledge of Python programming
    1. Python3 is used as the development environment
    2. Also have basic knowledge of packing
  2. Basic web programming knowledge
  3. Basic understanding of HTTP protocol

This is the simplest computer program

Speaking of reptiles, everyone has more or less heard something about them, and finds them sophisticated or interesting. As a beginner coder who has been writing code for four or five years, crawlers are one of the simplest and most interesting computer programs. Anyone who can surf the Internet has a talent for writing crawlers.

Why is a reptile a worm? Because the worm’s mind is relatively simple, the crawler is also ++ “single-minded” ++, does not need to hide the deep mathematical knowledge, also do not need to design all kinds of sophisticated algorithms. We just need to speak in plain English that the computer can understand.

Basic routines for developing crawlers

In a word, the function of all crawlers is to search, download and store data by simulating the operation of human beings on the Internet. Next, I’m going to take this website as an example to share routines.

Step 1: Open the target url

Chrome is highly recommended

First open the website: www.nanrentu.cc/sgtp/, you will see the following screen

Now that we’ve just opened the page manually in the browser, we’re going to use code to make the application open the page as well. It’s time to analyze how the browser opens the page, see a simple flow chart.

In Python, you can use the Requests toolkit to send HTTP requests. To understand what the page “sees” looks like, we need to save the HTML file that the program gets locally and then open it in a browser so that it can feel what the program feels like. So as to achieve the realm of “man-machine integration”.

Let’s create a new index.py file and write something like this:

import requests
url = "https://www.nanrentu.cc/sgtp/"
response = requests.get(url)
if response.status_code == 200:
    with open("result.html".'a',encoding="utf-8") as f:
        f.write(response.text)
Copy the code

When the browser opens, the HTML file is written like this

How is this different from what you see in the browser?

This time I will show a unique baby ————Chrome debug console (press F12) to give you a wave of analysis.

In fact, the page we see in the browser is not just an HTML page, but a composite of CSS, JS, HTML, and various media resources and the browser renders the page. The red box marks the various resources that are loaded during this process.

When we call the server with a program, we get only HTML pages, so the program looks very different from the page we see. But it doesn’t matter that the HTML is the backbone, you’ve got the backbone and everything else is just going to follow along.

Step2: find the target resource

After opening this website, each fairy can take what he needs, want to experience xiao Yaxuan’s happiness? The target is young meat; Hungry for Eddie Peng’s body? That beefy guy is your type. In addition, Korean, European and American style men are also everything.

Humans are advanced creatures whose eyes automatically focus on a target, but a reptile is single-minded. It doesn’t automatically focus. We have to guide it.

Those of you who have written front-end pages know that CSS styles use various selectors to bind to the corresponding nodes, so we can also use CSS selectors to select the elements we want to extract information. Chrome already has a CSS selector artifact that generates a selector for the element you want.

The process is as follows: The third step is to enjoy the little brothers ~

Step3: parse the page

It’s time to introduce pyQuery, a page-parsing tool that uses CSS selectors we’ve copied to find elements in AN HTML page and easily extract attributes. So let’s figure out this little brother.

We will first install the PyQuery package, which can be installed using the PIP package manager, and then modify the code to look like this:

import requests
from pyquery import PyQuery as pq
url = "https://www.nanrentu.cc/sgtp/"
response = requests.get(url)
if response.status_code == 200:
    with open("result.html".'w',encoding="utf-8") as f:
        f.write(response.text)
    # start parsing
    doc = pq(response.text)
    Paste in the copied selector
    Select the corresponding node
    imgElement = doc('body > div:nth-child(5) > div > div > div:nth-child(2) > ul > li:nth-child(3) > a > img')
    Get the image link
    imgSrc = imgElement.attr('src')
    # Print the image link on the screen
    print(imgSrc)

Copy the code

Step4: store the target

How can such a good-looking little brother just stay on the Internet? It’s safest to put him in the study folder on the hard drive. Next, we put the little brother into the bowl.

The process of downloading an image is the same as that of grabbing an HTML page, using Requests to fetch the data stream and save it locally.

import requests
from pyquery import PyQuery as pq
url = "https://www.nanrentu.cc/sgtp/"
response = requests.get(url)
if response.status_code == 200:
    with open("result.html".'w',encoding="utf-8") as f:
        f.write(response.text)
    doc = pq(response.text)
    imgElement = doc('body > div:nth-child(5) > div > div > div:nth-child(2) > ul > li:nth-child(3) > a > img')
    imgSrc = imgElement.attr('src')
    print(imgSrc)
    # Download image
    imgResponse = requests.get(imgSrc)
    if imgResponse.status_code == 200:
        Write the file in binary format
        with open('Learning file /boy.jpg'.'wb') as f:
            f.write(imgResponse.content)
            f.close()

Copy the code

Now let’s take a look at the effect

Step four worm

So far only a dozen lines of code to complete a small crawler, is not very simple. In fact, the basic idea of crawler is the four steps, the so-called complex crawler is constantly evolving on the basis of the four steps. The ultimate purpose of crawler is to obtain various resources (text or pictures), and all operations are resource-centered.

  1. Open the resource
  2. Locate resources
  3. Parsing resources
  4. Download resources

More little brothers

Through the above steps we can only get a small brother, jimei said, I directly right click the mouse download can also ah, why bother to write crawler? So what we’re going to do is we’re going to upgrade a bunch of selectors, put them in an array, and get them all in the bowl.

Refactor the code

To make it easier to write code in the future, do a simple refactoring first to clean up the code.

  1. Add entry function
  2. Encapsulate operations on images

The refactored code looks like this:

import requests
from pyquery import PyQuery as pq

def saveImage(imgUrl,name):
    imgResponse = requests.get(imgUrl)
    fileName = "Study document /%s.jpg" % name
    if imgResponse.status_code == 200:
        with open(fileName, 'wb') as f:
            f.write(imgResponse.content)
            f.close()

def main(a):
    baseUrl = "https://www.nanrentu.cc/sgtp/"
    response = requests.get(baseUrl)
    if response.status_code == 200:
        with open("result.html".'w',encoding="utf-8") as f:
            f.write(response.text)
        doc = pq(response.text)
        imgElement = doc('body > div:nth-child(5) > div > div > div:nth-child(2) > ul > li:nth-child(3) > a > img')
        imgSrc = imgElement.attr('src')
        print(imgSrc)
        saveImage(imgSrc,'boy')
        
if __name__ == "__main__":
    main()
Copy the code

Upgrade selector

For those of you who have experience with front-end programming, Chrome’s auto-generated selector specifies a specific child element, so it selects only one sibling.

Take the mouse and click on the debugger and look at the element hierarchy of the HTML file layer by layer to find the same repetition. This is where our breakthrough lies.

We can see that the images are in the

    tag of a class named h-piclist, so we can write the following selector. H-piclist > li > a > img. This selects all image elements for the page. Then you just iterate through it with a for loop.
import requests
from pyquery import PyQuery as pq

# introduce UUID for image naming
import uuid

def saveImage(imgUrl,name):
    imgResponse = requests.get(imgUrl)
    fileName = "Study document /%s.jpg" % name
    if imgResponse.status_code == 200:
        with open(fileName, 'wb') as f:
            f.write(imgResponse.content)
            f.close()

def main(a):
    baseUrl = "https://www.nanrentu.cc/sgtp/"
    response = requests.get(baseUrl)
    if response.status_code == 200:
        with open("result.html".'w',encoding="utf-8") as f:
            f.write(response.text)
        doc = pq(response.text)
        Select all target image elements in this page
        imgElements = doc('.h-piclist > li > a > img').items()
        # Walk through the image elements
        for i in imgElements:
            imgSrc = i.attr('src')
            print(imgSrc)
            saveImage(imgSrc,uuid.uuid1().hex)

if __name__ == "__main__":
    main()
Copy the code

Images that cannot be downloaded

It can be seen that the link of the image has been all obtained, but when trying to download the image, something happened, the request for the image did not respond. Where did this go wrong? Picture link all get, prove that the code is not wrong, put the picture link in the browser to open normally, prove that the connection is not wrong, that may now be the network has a problem.

  1. Slow network
  2. Network fluctuation
  3. The other site has anti-crawling measures

At this time there are many factors, we first use the simplest method to solve the problem, reconnect. Restart the laptop WIFI, rejoin the network, and then run the program.

Surprise! The stinky brothers are in storage.

Of course, this approach is not a silver bullet, we need to have more skills to improve the “acting” of our crawlers, the more human-like our crawlers are, the more successful we will be at retrieving resources.

If there is a theme song in this world, it must be Xue Zhiqian’s “The Actor”. In the next tutorial, we will sharpen our “acting skills” again and get more little brothers.

If you think we have done a good job, please remember to like + follow + comment 🥰🥰🥰 and encourage us to write a better tutorial 💪

Want to learn more exciting practical skills tutorial? Come and visit the Tooquine community.