This is the third day of my participation in the November Gwen Challenge. Check out the details: the last Gwen Challenge 2021
Some time ago by the brothers entrusted, climb to stick a certain post in the hd picture.
The thing is, my buddy found a lot of beautiful pictures in the post bar, he wanted to download the original picture as wallpaper, but there were too many pictures in the post, he wanted all of them, so he wanted me to help write a crawler, batch download.
There are only two requirements:
- Download the original image
- Batch download
No more words, just get started.
1. Analyze the site
The elder brothers provide post address: tieba.baidu.com/p/651608483… .
By analyzing the URL composition, we can guess that 6516084831 is the id of the post.
In check only look at the building Lord, page after these operations, tieba.baidu.com/p/651608483 links into this… The URL has two extra parameters. See_lz =1 indicates that the current page is the first page.
Open the browser developer tool and switch to Network for packet capture.
Found post content data is directly rendered in HTML page (not separate data interface), that is to say, we only need to parse https://tieba.baidu.com/p/6516084831?see_lz=1&pn=2 web page, image data can get to the post.
2. Verification of anti-crawl mechanism
Write a simple piece of code in Python to test the anti-crawl mechanism
import requests
url = "https://tieba.baidu.com/p/6516084831?see_lz=1&pn=1"
r = requests.get(url)
print(r.text)
Copy the code
After testing, there is no special anti-crawl mechanism, and the User Agent can climb the data directly without validating it.
3. Extract data
There is no anti-crawl mechanism, and the data is in the static web page, so we directly look at the source of the web page, parsing data.
The image is in the img tag with class BDE_Image, and the image is linked to the tag’s SRC attribute.
import requests
from bs4 import BeautifulSoup
url = "https://tieba.baidu.com/p/6516084831?see_lz=1&pn=1"
r = requests.get(url)
html = r.text
bsObj = BeautifulSoup(html, "lxml")
imgList = bsObj.find_all("img", attrs = {"class": "BDE_Image"})
for img in imgList:
print(img["src"])
Copy the code
We search for all matching IMG tags using the find_all function in the BeautifulSoup library and then take the SRC attribute.
4. Download pictures
Downloading images is the same as crawling text, the only difference being that the data is binary.
import requests
import os
imgUrl = "http://tiebapic.baidu.com/forum/w%3D580/sign=1ecd59e749df8db1bc2e7c6c3922dddb/0f72eed3fd1f4134d5a75d01321f95cad0c85ead. jpg"
r = requests.get(imgUrl)
content = r.content
with open("image.jpg"."wb") as f:
f.write(content)
Copy the code
When processing network request response, set.content, save the file, set mode to WB.
5. Obtain the original image by linking
However, you’ll soon find that the image you download is not the original image, but a slightly scaled down image (580×326 vs. 1920×1080).
After exploring, I found the download link of the original picture on the page of viewing the larger picture
Slightly thumbnail: tiebapic.baidu.com/forum/w%3D5…
Original: tiebapic.baidu.com/forum/pic/i…
Comparative observation found that on the basis of the original link can be slightly changed
“Tiebapic.baidu.com/forum/pic/i…” + “0f72eed3fd1f4134d5a75d01321f95cad0c85ead.jpg”
6. Code consolidation
Through the above analysis, we can realize the function of batch downloading the original picture of Tieba.
The following is the complete source code after finishing.
import requests
from bs4 import BeautifulSoup
import os
def fetchUrl(url) :
r = requests.get(url)
return r.text
def parseHtml(html) :
bsObj = BeautifulSoup(html, "lxml")
imgList = bsObj.find_all("img", attrs = {"class": "BDE_Image"})
return imgList
def getPageNum(url) :
html = fetchUrl(url)
bsObj = BeautifulSoup(html, "lxml")
maxPage = bsObj.find("input", attrs={"id" : "jumpPage4"}) ["max-page"]
print(maxPage)
return int(maxPage)
def downLoadImage(imgList) :
for img in imgList:
imgName = img['src'].split("/")[-1]
imgUrl = "http://tiebapic.baidu.com/forum/pic/item/" + imgName
if os.path.exists("Hd big picture /" + imgName):
print("Skip,", imgName)
continue
picReq = requests.get(imgUrl)
saveFile("Hd big picture /", imgName, picReq.content)
print(imgName)
def saveFile(path, filename, content) :
if not os.path.exists(path):
os.makedirs(path)
with open(path + filename, "wb") as f:
f.write(content)
def run(tid) :
url = "https://tieba.baidu.com/p/%d?see_lz=1&pn=1" %tid
totalNum = getPageNum(url)
for page in range(1, totalNum + 1):
url = "https://tieba.baidu.com/p/%d?see_lz=1&pn=%d" % (tid, page)
html = fetchUrl(url)
imgList = parseHtml(html)
downLoadImage(imgList)
if __name__ == "__main__":
tid = 6516084831
run(tid)
print("over")
Copy the code
If there is something in the article that is not clear, or the explanation is wrong, please criticize and correct it in the comment section, or scan the qr code below and add our wechat. We can learn and communicate together and make progress together.