Teach you how to crawl - novel crawl

The reptile case – Ming dynasty stuff

From March to early April, I spent a long time reading a novel — Those Things in the Ming Dynasty. I spent almost a whole month reading it and became more and more fascinated. This is the charm of the novel.

The story begins with Zhu Yuanzhang’s begging for food, and through relentless wars, he defeats various rivals and establishes the Ming Dynasty. Later, Zhu Di, king of Yan, rebelled and qi Jiguang fought against Japanese pirates. Later, zhang Juzheng, the most famous chief minister of the Cabinet, appeared in the Ming Dynasty. Finally the Qing Dynasty into the pass, the Ming Dynasty or defeat in chongzhen’s hand, accurate to say: doomed to defeat in his hand. As the article writes:

Jie Ming Dynasty, life has done

The book is not only about history, rights, hope, pain, integrity, loneliness, cruelty, evil, patience, persistence, truth, loyalty… It’s all over the place. At the end of the book, the author writes a poem, excerpted here:

When cobwebs ruthlessly shut up my hearth

When the smoke of ashes sighs the sorrow of poverty

I still stubbornly smooth the ashes of disappointment

Write with beautiful snowflakes: Believe in the future



When my purple grapes are the dew of autumn

When my flowers nestled in someone else's feelings

I still stubbornly use the frosty dry vine

Write on the desolate earth: Believe in the future



I want to point to the waves rushing toward the horizon

I want to hold the sun in the palm of my hand the sea

The warm and pretty pen holder flickering in the dawn

Write in a child's hand: Believe in the future



The reason I firmly believe in the future

Is I believe in the future people's eyes

She has eyelashes that brush away the dust of history

She has pupils that see through the pages of time

No matter what people think about our rotten flesh

The lost melancholy, the pain of failure

It is to be moved with tears, with deep sympathy

Or disdainful smiles and bitter sarcasm

I firmly believe that people have an interest in our spines

The endless quests, missteps, failures and successes

Be sure to give a warm, objective and unbiased assessment



Yes, I'm anxiously awaiting their evaluation

My friend, believe firmly in the future

Believe in indomitable efforts

Believe in the victory over death of youth

Believe in the future and love life



# by: The moon

Copy the code

Crawl novel

The data source

This article shows you how to use Python to crawl a section of a website about the book.

Homepage: https://www.kanunu8.com/

Crawl the link: https://www.kanunu8.com/files/chinese/201102/1777.html

Crawl content

1. Chapter titles

2. Text content of chapters

Take the first chapter as an example: we can enter the main body of the first chapter by clicking on “Childhood in Chapter one”.

Crawl results

Let’s look at the data that we finally crawled. A folder generated in the local directory, “Things About the Ming Dynasty,” contains the contents of the 33 chapters we crawled, including the introduction and introductory sections.

Crawl the relevant libraries

Related libraries used in this crawler

from multiprocessing.dummy import Pool   # Pseudo multi process, speed up the crawl speed

import requests   Send a request to retrieve web page data

import re  # regex module, parse data

import os  The OS module handles files and directories

Copy the code

Crawl flow chart

Web analytics

Analyze the patterns of web pages

# main page: https://www.kanunu8.com/files/chinese/201102/1777.html



# preface: https://www.kanunu8.com/files/chinese/201102/1777/40607.html

# overture: https://www.kanunu8.com/files/chinese/201102/1777/40608.html

# the first chapter: https://www.kanunu8.com/files/chinese/201102/1777/40609.html

# 31: https://www.kanunu8.com/files/chinese/201102/1777/40639.html

Copy the code

Found a pattern: Each chapter page had its own URL suffix to distinguish it. Look at the source page to find the URL:

The suffix for each section’s URL has been found above

The source code

The re is not written well, and the address needs to be sliced again. Remove the main function source code:

import requests 

import re

import os

from multiprocessing.dummy import Pool



start_url = 'https://www.kanunu8.com/files/chinese/201102/1777.html'



def get_source(url):

    "" "

The getPage () function gets the content of a web page, including the initial web page and each chapter of the web page

Parameters:

URL address, including the home page or body page address

The return value:

Web source content

"" "

    response = requests.get(url=url)

    result = response.content.decode('gbk')

    return result  # return the original page source



def get_toc(source_code):

    "" "

Function description: from the initial page to parse out the URL address of each chapter, waiting for the subsequent crawl of the text of each chapter

Parameters:

The source of the initial page

The return value:

List of urls for text pages: [urL1,url2,......,url33]

"" "

    toc_url_list = []

    toc_url = re.findall(". *? 
      
       . *? '
      *?>,source_code,re.S)

    for url in toc_url:

        toc_url_list.append(start_url.split('1777') [0] + url)   #!!!!!! Note the construction of each section address

    return toc_url_list[1:34]   # Regex is not written well, and should be sliced again





def get_article(article_code):

    "" "

Function description: pass in the source code of each chapter, get the chapter name, body content

Parameters:

The source content of the body page

The return value:

The name of the section

Body content

"" "

    chapter_name = re.search('color="#dc143c">(.*?) ',article_code,re.S).group(1)   Get the first matched content

    article_text = re.search('(.*?) 
',article_code, re.S).group(1)  The entire content between the # p tags

    article_text = article_text.replace('<br />'.' ')   # Replace Spaces in articles

    return chapter_name, article_text

    

    

def save(chapter,article):

    "" "

Function description: Save according to the name and text of each chapter

Parameters:

The name of the section

The text of this chapter

Returned value: None

"" "

    os.makedirs("Things about the Ming Dynasty",exist_ok=True)

    with open(os.path.join('Those things in the Ming Dynasty', chapter + '.txt'), 'w', encoding='utf-8') as f:

        f.write(article)

        

        

def query_article(url):

    "" "

Function description: Pass in the text URL address, get the chapter name and text, and save

Parameters:

The URL address of the body page

Returned value: None

"" "

    article_code = get_source(url)  Get the page source function

    chapter, article_text = get_article(article_code)  # Get the chapter name and body content through the web source code

    save(chapter, article_text)  # save text

Copy the code

Regular slicing problem

The home page source returns the result of the content parsing:

url = 'https://www.kanunu8.com/files/chinese/201102/1777.html'

response = requests.get(url=url)

res = response.content.decode('gbk')



toc_url_list = []

toc_url = re.findall(". *? 
      
       . *? '
      *?>,res,re.S)

print(toc_url)

for url in toc_url:

    toc_url_list.append(start_url.split('1777') [0] + url)   #!!!!!! Note the construction of each section address

    

toc_url_list

Copy the code

Valid URL after slice:

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Teach you how to crawl – novel crawl

The reptile case – Ming dynasty stuff

Crawl novel

The data source

Crawl content

Crawl results

Crawl the relevant libraries

Crawl flow chart

Web analytics

The source code

Regular slicing problem

Teach you how to crawl – novel crawl

The reptile case – Ming dynasty stuff

Crawl novel

The data source

Crawl content

Crawl results

Crawl the relevant libraries

Crawl flow chart

Web analytics

The source code

Regular slicing problem

Related Posts

[Cloud Weekly] Issue 137: Ali knowledge Map first exposure: every day tens of millions of intercepts, hundreds of millions of levels of full intelligent audit

An article to figure out what MPP is?

Locust optimization algorithm (GOA)