The GeneralNewsExtractor (GNE) is a general-purpose extractor for news web pages that extracts the body of a news site without specifying any extraction rules.
Let’s take a look at the basic usage.
Install GNE
Install with PIP:
pip install --upgrade git+https://github.com/kingname/GeneralNewsExtractor.git
Copy the code
Of course you can also install pipenV:
pipenv install git+https://github.com/kingname/GeneralNewsExtractor.git#egg=gne
Copy the code
Get news page source code
GNE does not and will not provide web request functionality, so you will need to find your own way to get the rendered source code. You can use Selenium or Pyppeteer or copy directly from the browser.
Here’s how to copy the source code of a web page directly from the browser:
- Open the corresponding page in Chrome browser, and then open the developer tool, as shown below:
- On the Elements TAB, locate the tag, right-click it, and select copy-Copy OuterHTML, as shown in the following figure
- Save the source code as 1.html
Extracting body Information
Write the following code:
from gne import GeneralNewsExtractor
with open('1.html') as f:
html = f.read()
extractor = GeneralNewsExtractor()
result = extractor.extract(html)
print(result)
Copy the code
The operating effect is shown in the figure below:
What was updated this time
In the latest update of V0.04, open text image extraction function, and return text source code function. The ability to return the image URL was demonstrated above, and the images field in the result is the image in the body.
So how do I return the body source? Just add with_body_html=True:
from gne import GeneralNewsExtractor
with open('1.html') as f:
html = f.read()
extractor = GeneralNewsExtractor()
result = extractor.extract(html, with_body_html=True)
print(result)
Copy the code
The operating effect is shown in the figure below:
The body_html returned in the result is the HTML source code for the body.
For more in-depth use of GNE, visit GNE’s Github: github.com/kingname/Ge… .