directory

What is a web crawler?

How does a reptile work?

1. Obtain the web page link of “Pikachu pictures” in Baidu pictures

2, get the full code of the web page

3. Find links to images in your code

4. Write generic regular expressions based on image links

5. Match all image links that meet the requirements in the code through the set regular expression

6. Open the image link one by one and download the image


Hello! Hello everyone, I am a grey ape trying to earn money to buy hair shampoo. Many friends who have learned Python hope to have a crawler of their own, so today the big bad Wolf will share with friends a simple crawler programming.

Allow me to keep you in suspense.

 

What is a web crawler?

The so-called web crawler, in simple terms, is to open a specific web page through the program and crawl some information existing on the web page. If you think of a web page as a field, reptiles are the insects that live in the field, crawl from top to bottom, and only eat certain kinds of food in the field. Haha, the analogy is a bit rough, but that’s pretty much what a web crawler actually does.

If you want to know more about it, you can also read my article.Python in a Minute takes you to the hidden web bugs!”!

 

How does a reptile work?

How does a crawler work?

Here’s an example:

All the web pages we see are composed of specific codes, which cover all the information in this web page. When we open a web page, press F12 to see the code of this web page. Let’s take baidu Image search pikachu web page as an example, after pressing F12, You can see the following code that covers the entire content of the web page.

 

 

Take a crawler that crawls “Pikachu pictures” as an example. Our crawler wants to crawl all the pikachu pictures on this web page, so what our crawler needs to do is to find the link that includes pikachu pictures in the code of this web page and download the pictures under this link.

So the working principle of the crawler is found in the web page code and extract specific code that is found in a long string of specific format string is the same, are interested in this piece of knowledge in the junior partner can read my this article “Python specific text extraction of actual combat, challenge the first step in efficient office”,

With that in mind, it’s time to write such a crawler.

Python crawlers include urllib2 and Requests modules. Urllib2 is more complex than Requests module, so we’ll use the Requests module as an example to write a crawler.

 

Take baidu pikachu picture for example.

According to the principle of crawler, our crawler program should do the following in sequence:

  1. Get baidu picture “picchu picture” page link
  2. Get the full code for the page
  3. Find links to images in your code
  4. Write generic regular expressions based on picture links
  5. Match all image links that meet the requirements in the code through the set regular expression
  6. Open the image link one by one and download the image

Next, the big bad Wolf will share with you the compilation of this reptile according to the above steps:

 

 

1. Obtain the web page link of “Pikachu pictures” in Baidu pictures

First of all, we open baidu pictures’ webpage link, image.baidu.com

 

Then open the keyword search “pikachu” after the link

Image.baidu.com/search/inde…

 

Contrast, eliminate redundant part, we can get the picture baidu keyword search general links like this: long image.baidu.com/search/inde…

Now our first step to get baidu picture “Pikachu picture” page link has been completed, the next is to get all the code of the page

 

 

2, get the full code of the web page

At this point, we can open the link using the Get () function under the Requests module

Then through the text function in the module to get the text of the web page, that is, all the code.

url = "Http://image.baidu.com/search/index?tn=baiduimage&word= Pikachu"
urls = requests.get(url) # open link
urltext = urls.text Get the link text
Copy the code

 

 

3. Find links to images in your code

In this step, we can first open the webpage link and press F12 to see all the code of the webpage as the big bad Wolf said at the beginning. Then, if we want to climb all the JPG images, we can press Ctrl+F again to find the code of the specific content.

If we find the code with JPG in the page’s code, and then find something like this,

 

The link is where we want to get the content, and if we look at these links carefully, we will find that they have something in common, that is, each link will be preceded by an “OpjURL” : a hint, and then a “at the end.

And we take out one of the links

Dnptystore. Qbox. Me/p/chapter/a…

It is also possible to open the image.

Therefore, we can temporarily infer that the general format of JPG images in Baidu pictures is “” OpjURL” :XXXX “.

 

 

4. Write generic regular expressions based on image links

Now that we know that the common format for this type of image is “” OpjURL” :XXXX “”, it’s time to write the regular expression according to this format.

urlre = re.compile('"objURL":"(.*?) "', re.S)
# where re.s is used to make the regular expression ". Can match all "\n" newlines.
Copy the code

If you are not familiar with regular expressions, you can also read my two articles “Regular Expressions in Python Tutorial (Basics)” and “Regular Expressions in Python Tutorial (Improvement)”.

 

 

5. Match all image links that meet the requirements in the code through the set regular expression

Now that we’ve written the image link regular expression above, it’s time to match the entire code with the regular expression and get a list of all links

urllist = re.findall(urltext)

Get the list of image links, where urltext is the entire code for the entire page,
Copy the code

Next we use a few lines of code to verify that we are matching the image link through the expression, and write all matched links to TXT file:

with open("1.txt"."w") as txt:
    for i in urllist:
        txt.write(i + "\n")
Copy the code

Then we can see the matched image link in this file. Any copy can be opened.

 

 

6. Open the image link one by one and download the image

Now that we have all the links in the list, it’s time to download the images.

The basic idea is to loop through all the links in the list, open that link in binary, create a.jpg file, and write our image to that file in binary.

In order to avoid downloading too fast, we sleep for three seconds before each download, and the access time of each link is 5 seconds at most. If the access time exceeds 5 seconds, we will judge the download failed and continue to download the next chapter picture.

As for why images are opened and written in binary mode, it is because our images need to be parsed in binary mode before they can be written by the computer.

The code for downloading images is as follows, and the number of downloaded images is set to 3:

i = 0
for urlimg in urllist:
    time.sleep(3)   Program sleep for three seconds
    img = requests.get(urlimg, timeout = 5).content     Open the image link in binary form
    if img:
        with open(str(i) + ".jpg"."wb") as imgs:   Create a new JPG file to write in binary
            print("Downloading %s image %s" % (str(i+1), urlimg))
            imgs.write(img)     # Write the image
            i += 1
        if i == 3:  # To avoid unlimited downloads, set the download size to 3 images here
            break
    else:
        print("Download failed!")
Copy the code

Now, a simple crawling baidu Pikachu picture crawler is completed, partners can also arbitrarily change the picture keywords and download the number of pieces, to cultivate a crawler of their own.

 

Finally, attach the full source code:

import requests
import re
import time

url = "Http://image.baidu.com/search/index?tn=baiduimage&word= Pikachu"

s = requests.session()
s.headers['User-Agent'] ='the Mozilla / 5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36'
urls = s.get(url).content.decode('utf-8')

# urls = requests. Get (URL
# requests.get(url="https://www.baidu.com/")
urltext = urls  Get the full text of the link
urlre = re.compile('"objURL":"(.*?) "', re.S)  Write regular expressions
urllist = urlre.findall(urltext)  # match by re

with open("1.txt"."w") as txt:  Write the matching link to a file
    for i in urllist:
        txt.write(i + "\n")
i = 0

Loop through the list and download the image
for urlimg in urllist:
    time.sleep(3)  Program sleep for three seconds
    img = requests.get(urlimg, timeout=5).content  Open the image link in binary form
    if img:
        with open(str(i) + ".jpg"."wb") as imgs:  Create a new JPG file to write in binary
            print("Downloading %s image %s" % (str(i + 1), urlimg))
            imgs.write(img)  # Write the image
            i += 1
        if i == 5:  # To avoid unlimited downloads, set the download size to 3 images here
            break
    else:
        print("Download failed!")

print("Download complete!")
Copy the code

 

Find it useful rememberThumb up attentionHey!

At the same time, you can also follow my wechat official account “Gray Wolf Hole master”, reply to “Python Notes” to obtain ****Python from the beginning to master notes sharing and common functions and methods quick check manual!

Big bad Wolf is looking forward to progress with you!