The reptile case – Ming dynasty stuff
From March to early April, I spent a long time reading a novel — Those Things in the Ming Dynasty. I spent almost a whole month reading it and became more and more fascinated. This is the charm of the novel.
The story begins with Zhu Yuanzhang’s begging for food, and through relentless wars, he defeats various rivals and establishes the Ming Dynasty. Later, Zhu Di, king of Yan, rebelled and qi Jiguang fought against Japanese pirates. Later, zhang Juzheng, the most famous chief minister of the Cabinet, appeared in the Ming Dynasty. Finally the Qing Dynasty into the pass, the Ming Dynasty or defeat in chongzhen’s hand, accurate to say: doomed to defeat in his hand. As the article writes:
Jie Ming Dynasty, life has done
The book is not only about history, rights, hope, pain, integrity, loneliness, cruelty, evil, patience, persistence, truth, loyalty… It’s all over the place. At the end of the book, the author writes a poem, excerpted here:
When cobwebs ruthlessly shut up my hearth
When the smoke of ashes sighs the sorrow of poverty
I still stubbornly smooth the ashes of disappointment
Write with beautiful snowflakes: Believe in the future
When my purple grapes are the dew of autumn
When my flowers nestled in someone else's feelings
I still stubbornly use the frosty dry vine
Write on the desolate earth: Believe in the future
I want to point to the waves rushing toward the horizon
I want to hold the sun in the palm of my hand the sea
The warm and pretty pen holder flickering in the dawn
Write in a child's hand: Believe in the future
The reason I firmly believe in the future
Is I believe in the future people's eyes
She has eyelashes that brush away the dust of history
She has pupils that see through the pages of time
No matter what people think about our rotten flesh
The lost melancholy, the pain of failure
It is to be moved with tears, with deep sympathy
Or disdainful smiles and bitter sarcasm
I firmly believe that people have an interest in our spines
The endless quests, missteps, failures and successes
Be sure to give a warm, objective and unbiased assessment
Yes, I'm anxiously awaiting their evaluation
My friend, believe firmly in the future
Believe in indomitable efforts
Believe in the victory over death of youth
Believe in the future and love life
# by: The moon
Copy the code
Crawl novel
The data source
This article shows you how to use Python to crawl a section of a website about the book.
Homepage: https://www.kanunu8.com/
Crawl the link: https://www.kanunu8.com/files/chinese/201102/1777.html
Crawl content
1. Chapter titles
2. Text content of chapters
Take the first chapter as an example: we can enter the main body of the first chapter by clicking on “Childhood in Chapter one”.
Crawl results
Let’s look at the data that we finally crawled. A folder generated in the local directory, “Things About the Ming Dynasty,” contains the contents of the 33 chapters we crawled, including the introduction and introductory sections.
Crawl the relevant libraries
Related libraries used in this crawler
from multiprocessing.dummy import Pool # Pseudo multi process, speed up the crawl speed
import requests Send a request to retrieve web page data
import re # regex module, parse data
import os The OS module handles files and directories
Copy the code
Crawl flow chart
Web analytics
Analyze the patterns of web pages
# main page: https://www.kanunu8.com/files/chinese/201102/1777.html
# preface: https://www.kanunu8.com/files/chinese/201102/1777/40607.html
# overture: https://www.kanunu8.com/files/chinese/201102/1777/40608.html
# the first chapter: https://www.kanunu8.com/files/chinese/201102/1777/40609.html
# 31: https://www.kanunu8.com/files/chinese/201102/1777/40639.html
Copy the code
Found a pattern: Each chapter page had its own URL suffix to distinguish it. Look at the source page to find the URL:
The suffix for each section’s URL has been found above
The source code
The re is not written well, and the address needs to be sliced again. Remove the main function source code:
import requests
import re
import os
from multiprocessing.dummy import Pool
start_url = 'https://www.kanunu8.com/files/chinese/201102/1777.html'
def get_source(url):
"" "
The getPage () function gets the content of a web page, including the initial web page and each chapter of the web page
Parameters:
URL address, including the home page or body page address
The return value:
Web source content
"" "
response = requests.get(url=url)
result = response.content.decode('gbk')
return result # return the original page source
def get_toc(source_code):
"" "
Function description: from the initial page to parse out the URL address of each chapter, waiting for the subsequent crawl of the text of each chapter
Parameters:
The source of the initial page
The return value:
List of urls for text pages: [urL1,url2,......,url33]
"" "
toc_url_list = []
toc_url = re.findall(". *?
. *? '
*?>,source_code,re.S)
for url in toc_url:
toc_url_list.append(start_url.split('1777') [0] + url) #!!!!!! Note the construction of each section address
return toc_url_list[1:34] # Regex is not written well, and should be sliced again
def get_article(article_code):
"" "
Function description: pass in the source code of each chapter, get the chapter name, body content
Parameters:
The source content of the body page
The return value:
The name of the section
Body content
"" "
chapter_name = re.search('color="#dc143c">(.*?) ',article_code,re.S).group(1) Get the first matched content
article_text = re.search('(.*?)
',article_code, re.S).group(1) The entire content between the # p tags
article_text = article_text.replace('<br />'.' ') # Replace Spaces in articles
return chapter_name, article_text
def save(chapter,article):
"" "
Function description: Save according to the name and text of each chapter
Parameters:
The name of the section
The text of this chapter
Returned value: None
"" "
os.makedirs("Things about the Ming Dynasty",exist_ok=True)
with open(os.path.join('Those things in the Ming Dynasty', chapter + '.txt'), 'w', encoding='utf-8') as f:
f.write(article)
def query_article(url):
"" "
Function description: Pass in the text URL address, get the chapter name and text, and save
Parameters:
The URL address of the body page
Returned value: None
"" "
article_code = get_source(url) Get the page source function
chapter, article_text = get_article(article_code) # Get the chapter name and body content through the web source code
save(chapter, article_text) # save text
Copy the code
Regular slicing problem
The home page source returns the result of the content parsing:
url = 'https://www.kanunu8.com/files/chinese/201102/1777.html'
response = requests.get(url=url)
res = response.content.decode('gbk')
toc_url_list = []
toc_url = re.findall(". *?
. *? '
*?>,res,re.S)
print(toc_url)
for url in toc_url:
toc_url_list.append(start_url.split('1777') [0] + url) #!!!!!! Note the construction of each section address
toc_url_list
Copy the code
Valid URL after slice: