The preparation of the instruments

  • Local environment: Windows10 Professional edition
  • Operating system: 64-bit
  • Python version: Python 3.8
  • Run the tool PyCharm 2020.2

Step analysis and code implementation

Let’s import some libraries before we start

import requests
from lxml import etree
Copy the code

Get the source code to open a website for analysis

! [](https://p6-tt-ipv6.byteimg.com/large/pgc-image/f97f8d788f0c45df95f8bc14597a14c7)

One is the URL, two is the title of the article, and three is the content and then we put it in the code, and we get the page that’s in this interface

Url = requests. Get (' https://www.chnlib.com/zuowenku/ ') HTML = url. Content. decode () # of cipher processing, there is no written content of the default value is "utf-8" print (HTML)Copy the code
  1. Returns the result

    So that’s the code for the page, and that’s the proof that we got to the page.

  2. Get the url of the composition chapter has been able to get the source code of the web page, next we will analyze where we want the article, you can see the interface is that each article is a link, F12 can view the source code of the web page. Click the small arrow in the upper right corner, select the composition page, you can see a label on the right which is the link of the article, each article needs to open a link.

    Now, how do YOU get these links? You can see from the source code that they are all in the same format. Okay

    There’s one down here

    Get it belowSo let’s use xpath to get this

    The simple way to do this is by right clicking on the page

    There is a copy-> copy xpath where you can get the xpath path directly and construct an xpath to parse doc = etree.html (HTML) # construct xpath parse object @select object contents = doc.xpath(‘//*[@class=”list-group”]/div’) print(contents)

! [](https://p1-tt-ipv6.byteimg.com/large/pgc-image/00c523dc0abf4fcb86f07b84348bac63)

If I look at contents, it’s an Element, I can’t see the content, so I have to go through that Element and get it for in

The followingThe href

  • Xpath (‘h4/a/@href’) # print(links)

Now that you have the URL for each article, the next step is to get the title and content of each article. Again, I’m going to use xpath to get it.

  • Content = doc.xpath(‘//*[@id=”content”]/p/text()’)# Doc. Xpath (‘/HTML/body/div [4] / div/div [1] / div/div [1] / h1 / text () ‘) # to obtain title title1 = [t.r eplace (‘ \ r \ n ‘, ‘) for t in the title]
  1. So far, the title and content of each article have been obtained, and finally the obtained data is saved
  2. Store data with open(‘ Download /%s.txt’ %title1[0], ‘w’, encoding=’ UTF-8 ‘) as F: for items in content: f.rite (items)