This is the third day of my participation in the November Gwen Challenge. Check out the details: the last Gwen Challenge 2021

Some time ago by the brothers entrusted, climb to stick a certain post in the hd picture.

The thing is, my buddy found a lot of beautiful pictures in the post bar, he wanted to download the original picture as wallpaper, but there were too many pictures in the post, he wanted all of them, so he wanted me to help write a crawler, batch download.

There are only two requirements:

  1. Download the original image
  2. Batch download

No more words, just get started.

1. Analyze the site

The elder brothers provide post address: tieba.baidu.com/p/651608483… .

By analyzing the URL composition, we can guess that 6516084831 is the id of the post.

In check only look at the building Lord, page after these operations, tieba.baidu.com/p/651608483 links into this… The URL has two extra parameters. See_lz =1 indicates that the current page is the first page.

Open the browser developer tool and switch to Network for packet capture.

Found post content data is directly rendered in HTML page (not separate data interface), that is to say, we only need to parse https://tieba.baidu.com/p/6516084831?see_lz=1&pn=2 web page, image data can get to the post.

2. Verification of anti-crawl mechanism

Write a simple piece of code in Python to test the anti-crawl mechanism

import requests

url = "https://tieba.baidu.com/p/6516084831?see_lz=1&pn=1"
r = requests.get(url)
print(r.text)
Copy the code

After testing, there is no special anti-crawl mechanism, and the User Agent can climb the data directly without validating it.

3. Extract data

There is no anti-crawl mechanism, and the data is in the static web page, so we directly look at the source of the web page, parsing data.

The image is in the img tag with class BDE_Image, and the image is linked to the tag’s SRC attribute.

import requests
from bs4 import BeautifulSoup

url = "https://tieba.baidu.com/p/6516084831?see_lz=1&pn=1"
r = requests.get(url)
html = r.text
bsObj = BeautifulSoup(html, "lxml")
imgList = bsObj.find_all("img", attrs = {"class": "BDE_Image"})
for img in imgList:
    print(img["src"])
Copy the code

We search for all matching IMG tags using the find_all function in the BeautifulSoup library and then take the SRC attribute.

4. Download pictures

Downloading images is the same as crawling text, the only difference being that the data is binary.

import requests
import os

imgUrl = "http://tiebapic.baidu.com/forum/w%3D580/sign=1ecd59e749df8db1bc2e7c6c3922dddb/0f72eed3fd1f4134d5a75d01321f95cad0c85ead. jpg"
r = requests.get(imgUrl)
content = r.content
with open("image.jpg"."wb") as f:
	f.write(content)
Copy the code

When processing network request response, set.content, save the file, set mode to WB.

5. Obtain the original image by linking

However, you’ll soon find that the image you download is not the original image, but a slightly scaled down image (580×326 vs. 1920×1080).

After exploring, I found the download link of the original picture on the page of viewing the larger picture

Slightly thumbnail: tiebapic.baidu.com/forum/w%3D5…

Original: tiebapic.baidu.com/forum/pic/i…

Comparative observation found that on the basis of the original link can be slightly changed

“Tiebapic.baidu.com/forum/pic/i…” + “0f72eed3fd1f4134d5a75d01321f95cad0c85ead.jpg”

6. Code consolidation

Through the above analysis, we can realize the function of batch downloading the original picture of Tieba.

The following is the complete source code after finishing.

import requests
from bs4 import BeautifulSoup
import os

def fetchUrl(url) :
    r = requests.get(url)
    return r.text

def parseHtml(html) :
    bsObj = BeautifulSoup(html, "lxml")
    imgList = bsObj.find_all("img", attrs = {"class": "BDE_Image"})
    return imgList

def getPageNum(url) :
    html = fetchUrl(url)
    bsObj = BeautifulSoup(html, "lxml")
    maxPage = bsObj.find("input", attrs={"id" : "jumpPage4"}) ["max-page"]
    print(maxPage)
    return int(maxPage)

def downLoadImage(imgList) :
    for img in imgList:
        imgName = img['src'].split("/")[-1]
        imgUrl = "http://tiebapic.baidu.com/forum/pic/item/" + imgName
        
        if os.path.exists("Hd big picture /" + imgName):
            print("Skip,", imgName)
            continue
        
        picReq = requests.get(imgUrl)
        saveFile("Hd big picture /", imgName, picReq.content)
        print(imgName)

def saveFile(path, filename, content) :
    
    if not os.path.exists(path):
        os.makedirs(path)
    
    with open(path + filename, "wb") as  f:
        f.write(content)
        
def run(tid) :
    url = "https://tieba.baidu.com/p/%d?see_lz=1&pn=1" %tid
    totalNum = getPageNum(url)
    for page in range(1, totalNum + 1):
        url = "https://tieba.baidu.com/p/%d?see_lz=1&pn=%d" % (tid, page)
        html = fetchUrl(url)
        imgList = parseHtml(html)
        downLoadImage(imgList)
    
if __name__ == "__main__":
    tid = 6516084831
    run(tid)
    print("over")
Copy the code

If there is something in the article that is not clear, or the explanation is wrong, please criticize and correct it in the comment section, or scan the qr code below and add our wechat. We can learn and communicate together and make progress together.