Long press point praise, waiting for you to do!! @TOC

1. Introduction

To tell the truth, the crawler may be the most difficult crawler I have encountered at present, mainly before crawling are some static resources of the website, although the website anti-crawling mechanism is only low-level, but for the novice I am also more difficult. The main ideas and solutions for anti-crawling are mainly from the old brother’s blog :mp.weixin.qq.com/s/wyS-OP04K…

2. Reverse crawl process

Anime House link: www.dmzj.com/

2.1 Basic Ideas

What I was thinking at the beginning was actually the same idea that the blogger had in his previous article (Python crawler — automatic download of cosplay little sister pictures). And so it turns out. That is, first climb the links of all chapters of the cartoon, and then climb the links of all comics in the chapter through the chapter link again, and then visit the links of the cartoon for three times and save the operation. But after the specific operation only to find the animation of the home in which is really under a lot of set.

2.2 Crawling chapter links

The problem with this step is not so much that you can use xpath to locate elements in a step-by-step manner. Simply analyze the structure of the page.So we can locate it directly with xpath

def getLinks(html) :
    chapter_link=[]
    chapter_title=[]
    parse=parsel.Selector(html)
    links=parse.xpath('//div[@class="tab-content tab-content-selected zj_list_con autoHeight"]/ul[@class="list_con_li autoHeight"]/li/a/@href').getall()
    titles=parse.xpath('//div[@class="tab-content tab-content-selected zj_list_con autoHeight"]/ul[@class="list_con_li autoHeight"]/li/a/span[@class="list_con_zj"]/text()').getall()
    for link in links:
        chapter_link.insert(0,link)
    for title in titles:
        chapter_title.insert(0, title)
    return chapter_link,chapter_title
Copy the code

Just note that the sections here are in descending order, so we need to flip them over as we crawl, so we can’t just use append and insert.

2.3 Crawling the Comic Links

This is where it starts to kill you. It’s hard because you haven’t met it before.

2.3.1 Cannot View the source code

Here I tried to look at the source code of his web page, but found that the right mouse button is not moving, which means that the device can not read the source code??

Later, I found through F12, F12 is still available, indicating that the home of animation is still given a way to live.But baidu found that this is only the lowest level of anti-crawling operations, just by adding the address in front of theview-source:I can display the source code of the web page.

2.3.2 Dynamic Loading

But after looking at the source code and searching for the link to the comic, I realized that there was nothing there. Play with a hammer!!After baidu at this time to know that this is called dynamic loading, dynamic loading is mainly the following two categories;

1. External loading 2. Internal loadingHere we check and find that it is an external load case. In fact, the blogger saw some data in the script that looked familiar, so he noticed it and didn’t dig into it. Mainly in the red area below: Comic links:Images.dmzj.com/img/chapter… Since we can at least piece together the link of the image, let’s extract the data first.

def getImgs(link) :
    pic_url=[]
    response=requests.get(link,headers=headers)
    html=BeautifulSoup(response.text,'lxml')
    script_info=html.script
    one = re.findall("\|(\d{4})\|".str(script_info))[0]
    two = re.findall("\|(\d{5})\|".str(script_info))[0]
    threes=re.findall('\ d {13, 14}'.str(script_info))
Copy the code

2.3.3 Comics out of order

I saw it when I was there, but I couldn’t figure out how he arranged it in this order, and I didn’t figure it out for a long time. But this elder brother is true cow force, casually try return true TM try out, I am true take. His last part of the figureIt’s either 13 or 14So how do I sort this? How else can I sort this number?Than the size of the baiBut then the 13 would be eitherSide by side at the back, orSide by side at the frontBut it’s not, so it has to beAdd a 0 to the end of the 13 bits to compare the sizeIn the operation. Oh, my god. Now that I think about it, I was fucking stupid not to think of that.

    for i, three in enumerate(threes):
        if len(three) == 13:
            threes[i] = three + '0'
    threes = sorted(threes, key=lambda x: int(x))
    for three in threes:
        if three[-1] = ='0':
            pic_url.append("https://images.dmzj.com/img/chapterpic/"+one+"/"+two+"/"+three[:-1] +".jpg")
        else:
            pic_url.append("https://images.dmzj.com/img/chapterpic/" + one + "/" + two + "/" + three + ".jpg")
Copy the code

2.3.4 Downloading Comic 403

To be honest, if it wasn’t for my big brother’s blog, IT would have been a long time. Here we mainly find that if you access the image link from the inside of his website, the image can be displayed normally, but if we refresh it, that is, if we directly access the link from the outside, the image cannot be displayed, as shown below:This is a typical passRefererAnti – crawler means!

Referer can be understood as the way of origin. First open the chapter URL link and then open the picture link. When opening the image, the section URL is saved in the information of the Referer. Here’s a simple example:

If you had only one door in your house, it would be obvious that you would have to go through that door to get into your house, but now someone has cut a hole in the wall of your house without going through your door, so obviously you are breaking the law.

The solution is as simple as telling the browser that I did come through the portal you provided.

# Download manga
    headers1={
        'Referer': "Chapter link",
    }
    response=requests.get(link,headers=headers1) 
Copy the code

Then we can access the image normally.

2.4 Downloading Images

So this is pretty easy. We’ve talked about the two ways to download files, but we’re going to go with Open.

# Download manga
def download(url,links,dir_name) :
    headers1={
        'Referer': url,
    }
    i=1;
    for link in links:
        pic_name = '%03d.jpg' % (i)
        new_dir_name = os.path.join(dir_name, pic_name)
        response=requests.get(link,headers=headers1)
        with open(new_dir_name, 'wb')as f:
            f.write(response.content)
            print(pic_name+"Download completed")
        i+=1
Copy the code

3. Effect demonstration

4. The source code

My code:

import requests
import parsel
import pypinyin
from bs4 import BeautifulSoup
import re
import os
import time

# Disguise the browser. Set the request header
headers={"User-Agent": "Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",}
Return the requested information for the page
def askUrl(url) :
    response=requests.get(url,headers=headers)
    html=response.content.decode('utf-8')
    return html

Get all section links and section names
def getLinks(html) :
    chapter_link=[]
    chapter_title=[]
    parse=parsel.Selector(html)
    links=parse.xpath('//div[@class="tab-content tab-content-selected zj_list_con autoHeight"]/ul[@class="list_con_li autoHeight"]/li/a/@href').getall()
    titles=parse.xpath('//div[@class="tab-content tab-content-selected zj_list_con autoHeight"]/ul[@class="list_con_li autoHeight"]/li/a/span[@class="list_con_zj"]/text()').getall()
    for link in links:
        chapter_link.insert(0,link)
    for title in titles:
        chapter_title.insert(0, title)
    return chapter_link,chapter_title

# Get links to all comics
def getImgs(link) :
    pic_url=[]
    response=requests.get(link,headers=headers)
    html=BeautifulSoup(response.text,'lxml')
    script_info=html.script
    one = re.findall("\|(\d{4})\|".str(script_info))[0]
    two = re.findall("\|(\d{5})\|".str(script_info))[0]
    threes=re.findall('\ d {13, 14}'.str(script_info))
    for i, three in enumerate(threes):
        if len(three) == 13:
            threes[i] = three + '0'
    threes = sorted(threes, key=lambda x: int(x))
    for three in threes:
        if three[-1] = ='0':
            pic_url.append("https://images.dmzj.com/img/chapterpic/"+one+"/"+two+"/"+three[:-1] +".jpg")
        else:
            pic_url.append("https://images.dmzj.com/img/chapterpic/" + one + "/" + two + "/" + three + ".jpg")
    return pic_url

# Download manga
def download(url,links,dir_name) :
    headers1={
        'Referer': url,
    }
    i=1;
    for link in links:
        pic_name = '%03d.jpg' % (i)
        new_dir_name = os.path.join(dir_name, pic_name)
        response=requests.get(link,headers=headers1)
        with open(new_dir_name, 'wb')as f:
            f.write(response.content)
            print(pic_name+"Download completed")
        i+=1

# main method
def main() :
    manhuas=input("Please enter the name of the cartoon you want to download:")
    dir_name = R 'D: \ comics'
    if not os.path.exists(dir_name + '/' + manhuas):
        os.makedirs(dir_name + '/' + manhuas)
    dir_name=dir_name + '/' + manhuas
    manhuas=pypinyin.pinyin(manhuas,style=pypinyin.NORMAL)
    name=' '
    for manhua in manhuas:
        name=name+' '.join(manhua)
    url="https://www.dmzj.com/info/"+name+".html"
    html=askUrl(url)
    links=getLinks(html)[0]
    names = getLinks(html)[1]
    for i,link in enumerate(links):
        if not os.path.exists(dir_name + '/' + str(names[i])):
            os.makedirs(dir_name + '/' + str(names[i]))
        print("Download:"+names[i])
        imglinks=getImgs(link)
        download(link,imglinks,dir_name + '/' + str(names[i]))
        print(names[i]+"Download complete")
        print("Take a break and continue downloading the next chapter a little bit.")
        time.sleep(10)
        print("-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --")
    print(manhuas+"It's all downloaded.")

Main function entry
if __name__ == '__main__':
    main()
Copy the code

Elder brother code:

import requests
import os
import re
from bs4 import BeautifulSoup
from contextlib import closing
from tqdm import tqdm
import time

Create a save directory
save_dir = 'Fairy Tale'
if save_dir not in os.listdir('/'):
    os.mkdir(save_dir)

target_url = "https://www.dmzj.com/info/yaoshenji.html"

Get the animation chapter link and chapter name
r = requests.get(url = target_url)
bs = BeautifulSoup(r.text, 'lxml')
list_con_li = bs.find('ul', class_="list_con_li")
cartoon_list = list_con_li.find_all('a')
chapter_names = []
chapter_urls = []
for cartoon in cartoon_list:
    href = cartoon.get('href')
    name = cartoon.text
    chapter_names.insert(0, name)
    chapter_urls.insert(0, href)

# Download manga
for i, url in enumerate(tqdm(chapter_urls)):
    download_header = {
        'Referer': url
    }
    name = chapter_names[i]
    # removed.
    while '. ' in name:
        name = name.replace('. '.' ')
    chapter_save_dir = os.path.join(save_dir, name)
    if name not in os.listdir(save_dir):
        os.mkdir(chapter_save_dir)
        r = requests.get(url = url)
        html = BeautifulSoup(r.text, 'lxml')
        script_info = html.script
        pics = re.findall('\ d {13, 14}'.str(script_info))
        for j, pic in enumerate(pics):
            if len(pic) == 13:
                pics[j] = pic + '0'
        pics = sorted(pics, key=lambda x:int(x))
        chapterpic_hou = re.findall('\|(\d{5})\|'.str(script_info))[0]
        chapterpic_qian = re.findall('\|(\d{4})\|'.str(script_info))[0]
        for idx, pic in enumerate(pics):
            if pic[-1] = ='0':
                url = 'https://images.dmzj.com/img/chapterpic/' + chapterpic_qian + '/' + chapterpic_hou + '/' + pic[:-1] + '.jpg'
            else:
                url = 'https://images.dmzj.com/img/chapterpic/' + chapterpic_qian + '/' + chapterpic_hou + '/' + pic + '.jpg'
            pic_name = '%03d.jpg' % (idx + 1)
            pic_save_path = os.path.join(chapter_save_dir, pic_name)
            with closing(requests.get(url, headers = download_header, stream = True)) as response:
                chunk_size = 1024
                content_size = int(response.headers['content-length'])
                if response.status_code == 200:
                    with open(pic_save_path, "wb") as file:
                        for data in response.iter_content(chunk_size=chunk_size):
                            file.write(data)
                else:
                    print('Link exception')
        time.sleep(10)
Copy the code

If you think it’s helpful, you can follow the blogger’s official account. The new guy up needs your support. If you have any questions, you can talk to the blogger privately.

This article is published by OpenWrite, a blogging tool platform