I learned the basic theory of penetration test for several days and let myself relax on the weekend. Recently, I heard that Tiansilkworm Tudou has a new novel, called yuan Zun. When I was a student, I was very fond of reading the novel Tiansilkworm Tudou. Today we are going to take a look at how to climb down all the contents of The Meta-Buddha step by step.
First of all, we need to choose a website to crawl, I use the book home website, the operation of other websites is similar principle.
Related library files
The libraries we use are Requests, Re, and time. Re and time are both python libraries, so we only need to install the Requests library.
pip install requests
Copy the code
The encoding process
We can visit the website of the book home page to find “Yuan Zun” THE URL –www.shujy.com/5200/9683/.
Request it through Requests, and then print out the HTML.
import requests
url ='https://www.shujy.com/5200/9683/'
response = requests.get(url)
html = response.text
print(html)
Copy the code
Print it out as follows:
We find the section of the HTML about the title and author of the article
We use regular expressions to extract the title and author
title = re.findall(r'
',html)[0]
author = re.findall(r'
',html)[0]
Copy the code
Next we need to pull out the link for each chapter of the novel. We use the F12 tool in the browser
- Click the arrow in the upper left corner
- Click on the element that we need to locate, we need to find the link for each chapter, so we click on “Body Chapter 1”.
- We can see the position of the object in the HTML in the developer tools
When we look at the location of links, we find that they are all in the middle of the “<div ID =”list”> div tag, and each link is placed after the href. So what we’re going to do is we’re going to take the content out of that div tag and then we’re going to take all the href links out of that and put them in a list.
dl =re.findall(r'
.*?
',html,re.S)[0]
chapter_info_list=re.findall(r' (. *?) ',dl)
Copy the code
Now that we have a list of all the chapters, we need to figure out how to get each one. We spliced the URL of the home page with the link of each chapter.
chapter_url = url+'/'+chapter_url
chapter_url = chapter_url.replace(' '.' ')
Copy the code
Then we also get the HTML file for the section content through the Requests library
chapter_response = requests.get(chapter_url)
chapter_html = chapter_response.text
Copy the code
In the same way, we find that the content of the body is inside the “<div ID =”content”>” div tag
Let’s take all the body content out of the div tag
Get the first page of text
chapter_content = re.findall(r'
(.*?)
',chapter_html,re.S)[0]
Copy the code
Let’s print out what we’ve got
We found some “<br />” and “  & emsp;” These are elements that we don’t want to see, so we filter them out with the replace function.
chapter_content = chapter_content.replace(' '.' ')
chapter_content = chapter_content.replace('<br />'.' ')
Copy the code
Let’s look at the filtered content
There are still some errors, why each line of text is an empty line? Let’s take a look at the chapter_content in the process with debug
If there are still some tabs, we’ll just keep the newline “\n”
chapter_content = chapter_content.replace('\r\n\r'.' ')
Copy the code
So we’ve stripped all the text of the page, but when we get to the end of the page, we realize that each chapter may not be just one page, but two, three, or more pages. How can we remove all this uncertainty?
We saw that in the body of each page, the total number of pages needed for this chapter was indicated, and the link of the next page was provided, so we used this clue to complete it.
First we need to take out the total number of pages and the link to the next page.
chapter_url,current_page,total_pages = re.findall(< p style="color:red; href="(.*?) "> < p style = "max-width: 100%; clear: both; min-height: 1em; The first (. *?) Pp/total (. *?) Page < / p > ', chapter_content,re.S)[0]
Copy the code
We then use a for loop to fetch the content in a similar manner, without going into detail, directly into the code.
for i in range(1.int(total_pages)):
chapter_url,current_page,total_pages = re.findall(< p style="color:red; href="(.*?) "> < p style = "max-width: 100%; clear: both; min-height: 1em; The first (. *?) Pp/total (. *?) Page < / p > ',chapter_content)[0]
chapter_url = url+'/'+chapter_url
chapter_url = chapter_url.replace(' '.' ')
chapter_response = requests.get(chapter_url)
chapter_html =chapter_response.text
chapter_content = re.findall(r'
(.*?)
', chapter_html, re.S)[0]
chapter_content = chapter_content.replace(' '.' ')
chapter_content = chapter_content.replace('<br />'.' ')
chapter_content = chapter_content.replace('\r\n\r'.' ')
f.write('\n')
f.write(chapter_content)
Copy the code
Finally, we just need to add a file write operation to the outside to write in the text of each read.
with open('%s.txt'%title,'w') as f:
f.write(title)
f.write('\n')
f.write(author)
f.write('\n')
Copy the code
List index out of range error handling
It looked like everything was finished, but when I finally came to download, I often made such mistakes in different chapters.
This time it might be in chapter 4, the next one might be in Chapter 10, but it’s not fixed. I looked it up and there are two ways this error can happen
- List [index] Index is out of range
- List [0] is an empty element with no elements.
Although the query to the reason, these two cases should not appear random chapter error ah, I still did not find the reason, if there is a god to see can be specified.
However, I found a way to avoid it, that is, since it is a random section error, that is, once I detect the error, I will request the URL again and pass the re check again. So I’ve got a function that looks like this.
def find(pattern,string,url) :
try:
chapter_content = re.findall(pattern, string, re.S)[0]
return chapter_content
except Exception as e:
print(e)
time.sleep(1)
chapter_response = requests.get(url)
chapter_html = chapter_response.text
print(chapter_html)
print(url)
i = find(pattern,chapter_html,url)
return i
Copy the code
It worked. I’ve been doing it, and now I’ve downloaded over 100 chapters
Now it seems that the only drawback is a bit slow, forgive the rookie xiao Bai has not done how to do multi-threading and multi-process, next time we will improve it.
The source code for
Source code access or the old rules, pay attention to the public number “rookie xiaobai learning to share”, private letter reply “novel crawler source” can be obtained.
Ok, today’s content is up, if you think rookie xiaobai’s share is helpful to you, please help to click a like, watch and follow, we will see you next time ~