I learned the basic theory of penetration test for several days and let myself relax on the weekend. Recently, I heard that Tiansilkworm Tudou has a new novel, called yuan Zun. When I was a student, I was very fond of reading the novel Tiansilkworm Tudou. Today we are going to take a look at how to climb down all the contents of The Meta-Buddha step by step.

First of all, we need to choose a website to crawl, I use the book home website, the operation of other websites is similar principle.

Related library files

The libraries we use are Requests, Re, and time. Re and time are both python libraries, so we only need to install the Requests library.

pip install requests
Copy the code

The encoding process

We can visit the website of the book home page to find “Yuan Zun” THE URL –www.shujy.com/5200/9683/.

Request it through Requests, and then print out the HTML.

import requests
url ='https://www.shujy.com/5200/9683/'
response = requests.get(url)
html = response.text

print(html)
Copy the code

Print it out as follows:

We find the section of the HTML about the title and author of the article

We use regular expressions to extract the title and author

title = re.findall(r'
       ',html)[0]
author = re.findall(r'
       ',html)[0]
Copy the code

Next we need to pull out the link for each chapter of the novel. We use the F12 tool in the browser

  1. Click the arrow in the upper left corner
  2. Click on the element that we need to locate, we need to find the link for each chapter, so we click on “Body Chapter 1”.
  3. We can see the position of the object in the HTML in the developer tools

When we look at the location of links, we find that they are all in the middle of the “<div ID =”list”> div tag, and each link is placed after the href. So what we’re going to do is we’re going to take the content out of that div tag and then we’re going to take all the href links out of that and put them in a list.

dl =re.findall(r'
      
.*?
'
,html,re.S)[0] chapter_info_list=re.findall(r' (. *?) ',dl) Copy the code

Now that we have a list of all the chapters, we need to figure out how to get each one. We spliced the URL of the home page with the link of each chapter.

chapter_url = url+'/'+chapter_url
chapter_url = chapter_url.replace(' '.' ')
Copy the code

Then we also get the HTML file for the section content through the Requests library

chapter_response = requests.get(chapter_url)
chapter_html = chapter_response.text
Copy the code

In the same way, we find that the content of the body is inside the “<div ID =”content”>” div tag

Let’s take all the body content out of the div tag

Get the first page of text
chapter_content = re.findall(r'
      
(.*?)
'
,chapter_html,re.S)[0] Copy the code

Let’s print out what we’ve got

We found some “<br />” and “&emsp; & emsp;” These are elements that we don’t want to see, so we filter them out with the replace function.

chapter_content = chapter_content.replace('    '.' ')
chapter_content = chapter_content.replace('<br />'.' ')
Copy the code

Let’s look at the filtered content

There are still some errors, why each line of text is an empty line? Let’s take a look at the chapter_content in the process with debug

If there are still some tabs, we’ll just keep the newline “\n”

chapter_content = chapter_content.replace('\r\n\r'.' ')
Copy the code

So we’ve stripped all the text of the page, but when we get to the end of the page, we realize that each chapter may not be just one page, but two, three, or more pages. How can we remove all this uncertainty?

We saw that in the body of each page, the total number of pages needed for this chapter was indicated, and the link of the next page was provided, so we used this clue to complete it.

First we need to take out the total number of pages and the link to the next page.

chapter_url,current_page,total_pages = re.findall(< p style="color:red; href="(.*?) "> < p style = "max-width: 100%; clear: both; min-height: 1em; The first (. *?) Pp/total (. *?) Page < / p > ', chapter_content,re.S)[0]
Copy the code

We then use a for loop to fetch the content in a similar manner, without going into detail, directly into the code.

for i in range(1.int(total_pages)):
    chapter_url,current_page,total_pages = re.findall(< p style="color:red; href="(.*?) "> < p style = "max-width: 100%; clear: both; min-height: 1em; The first (. *?) Pp/total (. *?) Page < / p > ',chapter_content)[0]
    chapter_url = url+'/'+chapter_url
    chapter_url = chapter_url.replace(' '.' ')
    chapter_response = requests.get(chapter_url)
    chapter_html =chapter_response.text
    chapter_content = re.findall(r'
      
(.*?)
'
, chapter_html, re.S)[0] chapter_content = chapter_content.replace('    '.' ') chapter_content = chapter_content.replace('<br />'.' ') chapter_content = chapter_content.replace('\r\n\r'.' ') f.write('\n') f.write(chapter_content) Copy the code

Finally, we just need to add a file write operation to the outside to write in the text of each read.

with open('%s.txt'%title,'w') as f:
    f.write(title)
    f.write('\n')
    f.write(author)
    f.write('\n')
Copy the code

List index out of range error handling

It looked like everything was finished, but when I finally came to download, I often made such mistakes in different chapters.

This time it might be in chapter 4, the next one might be in Chapter 10, but it’s not fixed. I looked it up and there are two ways this error can happen

  1. List [index] Index is out of range
  2. List [0] is an empty element with no elements.

Although the query to the reason, these two cases should not appear random chapter error ah, I still did not find the reason, if there is a god to see can be specified.

However, I found a way to avoid it, that is, since it is a random section error, that is, once I detect the error, I will request the URL again and pass the re check again. So I’ve got a function that looks like this.

def find(pattern,string,url) :
    try:
        chapter_content = re.findall(pattern, string, re.S)[0]
        return chapter_content
    except Exception as e:
        print(e)
        time.sleep(1)
        chapter_response = requests.get(url)

        chapter_html = chapter_response.text
        print(chapter_html)
        print(url)
        i = find(pattern,chapter_html,url)
    return i
Copy the code

It worked. I’ve been doing it, and now I’ve downloaded over 100 chapters

Now it seems that the only drawback is a bit slow, forgive the rookie xiao Bai has not done how to do multi-threading and multi-process, next time we will improve it.

The source code for

Source code access or the old rules, pay attention to the public number “rookie xiaobai learning to share”, private letter reply “novel crawler source” can be obtained.

Ok, today’s content is up, if you think rookie xiaobai’s share is helpful to you, please help to click a like, watch and follow, we will see you next time ~