No talent, 30 lines of code to write a Spring Festival couplet data crawler

PK creative Spring Festival, I am participating in the “Spring Festival creative submission contest”, please see: Spring Festival creative submission Contest

1. The origin

The first sentence is nuggets, people always have a platform slogan. This is mine, and I’ve always had a slogan:

I have two hobbies, one is traditional culture, the other is high-tech.

Yes, I have been exploring how to activate traditional culture with new technology and nourish new technology with traditional culture.

This is not, I am going to write a TensorFlow based on the automatic Spring Festival couplets procedures. It has been tried and done so far.

Input is the top line of manual Input, Output is the bottom line of automatic machine.

Input: <start> <end> [2, 61, 27, 26, 43, 4, 20, 78, 3] Output:<start> Input: <start> <end> [2, 167, 108, 23, 9, 4, 3] Input: <start> <end> <start> <end> [2, 92, 90, 290, 30, 8, 3] Input: <end> [2, 63, 137, 183, 302, 101, 3] Output: <start> <end> [2, 126, 312, 17, 4, 26, 3] Input: <start> <start> <end> [2, 1651, 744, 4, 140, 7, 3] Input: <start> <end> [2, 4, 5, 183, 60, 7, 71, 45, 3] Output: <start> <end> [2, 92, 90, 27, 4, 36, 99, 5, 3] Input: <start> <end> [2, 48, 8, 164, 76, 4, 5, 197, 50, 3] Output: <start> <end> [2, 6, 28, 167, 58, 17, 33, 15, 113, 3]Copy the code

2. Implement

The training behind artificial intelligence is a lot of data, and the training data alone is very difficult to find.

GitHub – wb14123/ Couplet -dataset: Dataset for Couplets. 700,000 couplets database.

However, I am not satisfied, because I want Spring Festival couplets, not couplets.

Although the couplets contain Spring Festival couplets, the Spring Festival couplets are filled with the taste of festival and peace.

So I found a Spring Festival couplets on the Internet web site, www.duiduilian.com/chunlian/, the inside of the content of good quality. So, I wrote a crawler to collect data.

2.1 Analysis Elements

Open the url and press F12 to analyze the page.

Analysis shows that the main content we are concerned about is between

, and a couplet is wrapped with

, used up and down, and divided.

This is a very standard example of data crawling material, perhaps designed for teaching purposes.

If we want to get the content of Spring Festival couplets, we only need to load down the HTML code through the website, and then take out the body part, and then group through the < P > tag, each pair of Spring Festival couplets by commas “, “points up and down, and finally stored in the file.

Open dry.

2.2 Retrieving Text

First load the url, take out the body of our concern.

import requests
import re

# Simulate a browser sending an HTTP request
response = requests.get(url)
# Set encoding mode
response.encoding='gbk'
Get the entire page HTML
html = response.text
Get the text from THE HTML based on tag analysis
allText = re.findall(r'
      
       .*? 
      
', html, re.S)[0]
print(url+"\n allText:"+allText)
Copy the code

One of the interesting things to note is that re is a library that supports regular expressions, and the re.findall above is to find anything in the HTML text that looks like

or anything else

. Because we might find more than one, but here there is only one scenario, taking the first [0] is what we want.

So we have the following:

<div class="content_zw"> <p> Spring comes to the eye, beaming with joy </p> <p> Spring is shining, blessing is long in </p> <p> Spring and jingming, fu fu is abundant </p> <p> Spring falls to the earth, blessing is full of human </p> </div>Copy the code

2.3 Break out the couplets

From the target HTML text, select the couplet.

text = "< div class = 'content_zw' > < p > spring to YanJi, beaming < / p > < p > spring shines, blessing in the long < / p > < / div >"
couplets = re.findall(r'(.*?) 
',text)
print(couplets) # [' Spring is coming, beaming with joy ', 'Spring is shining, blessing is long coming ']
for couplet in couplets:
    cs = couplet.split("，")
    print("Above:",cs[0].", next:",cs[1])
    # First couplet: spring to the eye, second couplet: beaming: spring is shining, second couplet: blessing
Copy the code

It still uses re.findall. This method is very common in crawlers. To crawl data is to take the full text and then rip out the text you’re interested in. How to tear, rely on RE with a series of rules to achieve.

re.findall(r’

(.*?)

‘,text) refers to the contents of the parentheses in the

shape that are selected from the

mess. It is worth saying here that it only needs the contents of ().

Here’s an example:

text = " Spring is coming, 
"
text_r1 = re.findall(r'(.*?) 
',text)[0]
print(text_r1) Spring is coming, beaming
text_r2 = re.findall(r'.*? 
',text)[0]
print(text_r2)  
Copy the code

Look at the two examples above to understand the difference between parentheses and no parentheses.

Get the couplets are actually an array [‘ spring comes with joy ‘, ‘spring comes with blessings ‘]. Then, loop through the array and split the upstream and downstream links with split(“, “). So you have the Spring Festival couplets up and down, and then you can do whatever you want.

2.4 And pagination

This is just a URL page.

But, actually, there are a lot of side-by-side pages.

We can manually enter the first entry page, but you have to automatically enter the other pages.

Here’s an analysis of the HTML code for the paging section:

Content_zw:

< div id = "pages" > < a class = "a1" href = "/ chunlian / 4 zi. HTML" > back < / a > < span > 1 < / span > < a href = "/ chunlian / 4 zi_2. HTML" > 2 < / a > < a A href = "/ chunlian / 4 zi_3. HTML" > 3 < / a >... < a href = "/ chunlian / 4 zi_8. HTML" > 8 < / a > < a class = "a1" href = "/ chunlian / 4 zi_2. HTML" > page < / a > < / div >Copy the code

Where the href inside the A tag is the url link relative to the path, let’s try to fetch it.

Get page-related page HTML
pages = re.findall(r'
      
       .*? 
      
', allText, re.S)[0]
page_list = re.findall(r'href="/chunlian/(.*?) "> (. *?) < ',pages)
page_urls = []
for page in page_list:
    if page[1] != 'Next page':
        page_url="https://www.duiduilian.com/chunlian/%s" % page[0]
        page_urls.append(page_url)
# page_urls is all the links
Copy the code

I’m sure you can see most of the code from the previous instructions. This is similar to getting the couplet in the P tag, except that there are two () in it. Let’s print the matched page_list:

page_list: [(' 4 zi. HTML ', 'previous page'), (' 4 zi_2. HTML ', '2'), (' 4 zi_3. HTML ', '3') and (' 4 zi_4. HTML ', '4'), (' 4 zi_5. HTML ', '5'), (' 4 zi_6. HTML ', '6'), (' 4 zi_7. HTML ', '7'), (' 4 zi_8. HTML ', '8'), (' 4 zi_2. HTML ', 'next')]Copy the code

Originally, re. The.findall (r ‘href = “/ chunlian/(. *?) “> (. *?) <‘,pages) means to select 2 places, respectively… /chunlian/ “>(chunlian)<… .

Except for the “next” button, the other links are exactly the full addresses of pages 1-8, so we get all of them.

2.5 All Codes

Ok, paging is also done, then all links have, each link how to tear down the Spring Festival couplet sentence also have, the following is all the code.

import requests
import re 

def getContent(url) :
    response = requests.get(url)
    response.encoding='gbk'
    html = response.text
    allText = re.findall(r'
      
       .*? 
      
', html, re.S)[0]
    return allText

def getPageUrl(allText) :
    pages = re.findall(r'
      
       .*? 
      
', allText, re.S)[0]
    page_list = re.findall(r'href="/chunlian/(.*?) "> (. *?) < ',pages)
    page_urls = []
    for page in page_list:
        if page[1] != 'Next page':
            page_url="https://www.duiduilian.com/chunlian/%s" % page[0]
            print("page_url:",page_url)
            page_urls.append(page_url)
    return page_urls

def do(url, file_name) :
    c_text = getContent(url)
    pages = getPageUrl(c_text)
    f = open(file_name,'w')
    for page_url in pages:
        page_text = getContent(page_url)
        page_couplets = re.findall(r'(.*?) 
',page_text)
        str = '\n'.join(page_couplets)+'\n'
        f.write(str)
    f.close() 

url = 'https://www.duiduilian.com/chunlian/4zi.html'
file_name = 'blog4.txt'
do(url, file_name)
Copy the code

I’ve deliberately omitted comments because I want to say that this feature is actually only 30 lines of code.

Eventually, it stores the retrieved data in a file named blog4.txt.

This is a four-character Spring Festival couplet, and a five-character, six-character, seven-character, you can do the same thing.

3. Ideal versus reality

After reading the above, you can go to eat, because you have mastered the greenhouse survival skills.

After dinner, I’ll tell you that the implementation of the above 30 lines of code is, in fact, an ideal situation.

There are actually a lot of exceptions.

For example, the following, the Spring Festival couplets body < P > tag there are what we want, there are also what we do not want, are certainly not used.

Look at the following, there is no < P > tag in the text of the Spring Festival couplets, then you have to do something about it.

If you look at the next one, when there’s too much paging, there’s ellipsis, some page numbers aren’t fully linked, some data isn’t available.

And finally, if you look at the next one, the pagination is a list, and you can’t do it the same way.

Right? There’s a difference between a tutorial and a game.

Tutorial, the cleaner the better, to use the shortest distance to tell about a knowledge point, the less interference the better.

Actual combat, the more real the better, to use the most comprehensive consideration to design a feature, the more exceptions the better.

I this article is to speak some one-sided knowledge tutorial, the main purpose is to do popular science work. I hope you will forgive the shortcomings.

In the next chapter, let’s talk about how to use 300 lines of code to achieve the training and prediction of Spring Festival couplets.

No talent, 30 lines of code to write a Spring Festival couplet data crawler

1. The origin

2. Implement

2.1 Analysis Elements

2.2 Retrieving Text

2.3 Break out the couplets

2.4 And pagination

2.5 All Codes

3. Ideal versus reality

Related Posts

Hong Kong media: Robot research and development to replace the controversial miners cause employment problems

Fundamentals of machine learning – Univariate linear regression

How to measure the similarity of two strings (Edit distance Dynamic Programming Solution)