GNE (GeneralNews Tractor) is a general news website body extraction module. It inputs THE HTML of a news page and outputs the body content, title, author, publication time, image address in the body, and tag source code of the body. GNE is very effective in extracting hundreds of Chinese news websites such as Toutiao, netease News, Youmingxingxing, Guancheng, Ifeng.com, Tencent News, ReadHub, Sina News, with almost 100% accuracy.

It’s very simple to use:

from gne import GeneralNewsExtractor

extractor = GeneralNewsExtractor()
html = 'Site source code'
result = extractor.extract(html)
print(result)
Copy the code

GNE input is HTML code rendered with JS, so GNE can be used in conjunction with Selenium or Pyppeteer.

Below is a Demo of GNE with Selenium implementation:

The corresponding code is:

import time
from gne import GeneralNewsExtractor
from selenium.webdriver import Chrome


driver = Chrome('./chromedriver')
driver.get('https://www.toutiao.com/a6766986211736158727/')
time.sleep(3)
extractor = GeneralNewsExtractor()
result = extractor.extract(driver.page_source)
print(result)
Copy the code

Below is GNE with Pyppeteer implementation Demo:

The corresponding code is as follows:

import asyncio
from gne import GeneralNewsExtractor
from pyppeteer import launch

async def main(a):
    browser = await launch(executablePath='/Applications/Google Chrome.app/Contents/MacOS/Google Chrome')
    page = await browser.newPage()
    await page.goto('https://news.163.com/20/0101/17/F1QS286R000187R2.html')
    extractor = GeneralNewsExtractor()
    result = extractor.extract(await page.content())
    print(result)
    input('When you're done, go back here and press any key.')

asyncio.run(main())
Copy the code

How do I install GNE

Now you can install GNE directly using PIP:

pip install gne
Copy the code

If accessing pYPI’s official source is too slow, you can also use netease source:

pip install gne -i https://mirrors.163.com/pypi/simple/
Copy the code

The installation process is shown below:

features

Get the source code

When the extract() method just passes in the source code of the web page without adding any additional parameters, GNE returns the following fields:

  • Title: News headlines
  • Publish_time: news release time
  • Author: News author
  • Content: indicates the news body
  • Images: Images in the body (relative path or absolute path)

For those of you who want to get the source code for the news body tag, pass with_body_html to the extract() method and set it to True:

extractor = GeneralNewsExtractor()
extractor.extract(html, with_body_html=True)
Copy the code

A field body_html will be added to the returned data, and its value will be the CORRESPONDING HTML source code for the body.

The operating effect is shown in the figure below:

Always return the absolute path to the image

By default, if the image in the news uses a relative path, the value of the images field returned by GNE is also a list of the relative path of the image.

If you want GNE to always return the absolute path, you can add the host parameter to the extract() method, whose value is the image’s domain name, for example:

extractor = GeneralNewsExtractor()
extractor.extract(html, host='https://www.kingname.info')
Copy the code

So, if the picture is in the news/images/PIC. PNG, then GNE back automatically turn it into an https://www.kingname.info/images/pic.png.

Specifies the XPath where the news headline resides

GNE predefined a set of XPaths and regular expressions to extract headlines from news stories. In this case, you can specify the title_xpath parameter to the extract() method to extract the headlines:

extractor = GeneralNewsExtractor()
extractor.extract(html, title_xpath='//title/text()')
Copy the code

Remove the noise label in advance

Some stories may contain lengthy comments that look “more like” the text than the text. To prevent them from interfering with the extract() method, remove the noise nodes in advance by passing noise_node_list to the extract() method. The noise_node_list value is a list of one or more xPaths:

extractor = GeneralNewsExtractor()
extractor.extract(html, noise_node_list=['//div[@class="comment-list"]'.'//*[@style="display:none"]'])
Copy the code

Using configuration files

The PARAMETERS title_xpath, host, noise_node_list, and with_body_html in the API can also be set in a configuration file, in addition to being written directly to extract().

Create a.gne file in the root directory of your project. The configuration file can be in YAML or JSON format.

  • Configuration file in YAML format
title:
   xpath: //title/text()
host: https://www.xxx.com
noise_node_list:
   - //div[@class=\"comment-list\"]
   - //*[@style=\"display:none\"]
with_body_html: true
Copy the code
  • Configuration file in JSON format
{
   "title": {
       "xpath": "//title/text()"
   },
   "host": "https://www.xxx.com"."noise_node_list": ["//div[@class=\"comment-list\"]"."//*[@style=\"display:none\"]"]."with_body_html": true
}
Copy the code

These are exactly equivalent.

The configuration file is the same as the parameters of the extract() method; not all fields need to be supplied. You can combine the fields you want.

If a parameter, both in the extract() method and in the.gne configuration file, has different values, then the extract() parameter has higher priority.

FAQ

Is the GeneralNewsExtractor(GNE) a crawler?

GNE is not a crawler, and its project name, General News Extractor, stands for General News Extractor. The input is HTML, and the output is a dictionary containing the headline, body of the news, author, and publication date. You need to get the HTML of the target page yourself.

GNE does not and will not provide the ability to request web pages.

Does GNE support page-turning?

GNE does not support page-turning. Since GNE doesn’t provide web page request functionality, you’ll need to grab the HTML for each page and pass it to GNE separately.

What versions of Python does GNE support?

Not less than Python 3.6.0

I used requests/Scrapy HTML to pass in to GNE. Why can’t I extract the body?

GNE is based on HTML to extract text, so the HTML passed in must be JavaScript rendered HTML. Requests and Scrapy only get the source code before JavaScript rendering, so they can’t extract it properly.

In addition, some web pages, such as Toutiao, actually have their news body written directly into the source code in JSON format. When the page is opened in the browser, JavaScript parses the source code into HTML. In this case, you won’t see Ajax requests on Chrome.

So I suggest you use the Puppeteer/Pyppeteer/Selenium tool to obtain through the rendered HTML to GNE again.

Does GNE support non-news sites (e.g. blogs, forums…)?

Is not supported.

About GNE

GNE official document: generalnewsextractor. Readthedocs. IO /

GNE project source code at: github.com/kingname/Ge… .

About the author