GNE is a general news body extractor, which has been used by many people as an important component of a general news body crawler since open source.
GNE Github address: github.com/GeneralNews… . Algorithm from the “Text and symbol density based web page text extraction method”, this algorithm is completely based on the information in HTML to find the text. Therefore, it has some congenital defects:
- If the text is only three or five sentences long, but the comments are lengthy, extraction will fail
- If there are too many HTML tags in the text, the text will be located in the wrong place
- Copyright information is often extracted
But if you ask people to look at the web page, there’s no mistake. Because the text is definitely not in the same place as the comments, the copyright information is usually at the bottom… These visual signals, determined by CSS, are not visible in HTML alone.
GNE input HTML, originally using simulated browser output HTML, is not the real web source code. In this case, why not just record the coordinates of each node when using a mock browser? When using the mock browser, you just need to execute a piece of JavaScript code to record whether each node is visible, the length, width and height of each visible node, and the coordinates of the upper left and lower right corner. In this way, GNE can refer to this information when parsing the text, and directly remove the invisible nodes, and remove the nodes that are obviously not the right size or location. Thus, the accuracy of text recognition is greatly improved.
What about the extraction effect based on visual signals? Let’s take a piece of news as an example: there was an incident in Guangxi Province, located in Laibin City, and the picture was exposed.
First in the browser developer tools, directly copy after JS rendering source code:
When we directly use GNE to identify the text, the operation effect is as shown in the picture below:
As you can see, the extracted information is copyright information.
Now, if you use the modified HTML code, you can successfully extract the body, as shown below:
So what’s so special about this modified HTML? Let’s see what it looks like:
All nodes below the body tag have a property called IS_visiable, which takes either true or false of the string. If it’s true, then there’s another property called coordinate. Its value is a JSON string containing the size, coordinates, and other information of the node.
So, how is this particular HTML generated? If you just want to do a temporary test, just execute this js code on the Console TAB of Chrome’s Developer Tools:
function insert_visiability_info() {
function get_body() {
var body = document.getElementsByTagName('body') [0]
return body
}
function insert_info(element) { is_visiable = element.offsetParent ! = =null
element.setAttribute('is_visiable', is_visiable)
if (is_visiable) {
react = element.getBoundingClientRect()
coordinate = JSON.stringify(react)
element.setAttribute('coordinate', coordinate)
}
}
function iter_node(node) {
children = node.children
insert_info(node)
if(children.length ! = =0) {
for(const element of children) {
iter_node(element)
}
}
}
function sizes() {
let contentWidth = [...document.body.children].reduce(
(a, el) = > Math.max(a, el.getBoundingClientRect().right), 0)
- document.body.getBoundingClientRect().x;
return {
windowWidth: document.documentElement.clientWidth,
windowHeight: document.documentElement.clientHeight,
pageWidth: Math.min(document.body.scrollWidth, contentWidth),
pageHeight: document.body.scrollHeight,
screenWidth: window.screen.width,
screenHeight: window.screen.height,
pageX: document.body.getBoundingClientRect().x,
pageY: document.body.getBoundingClientRect().y,
screenX: -window.screenX,
screenY: -window.screenY - (window.outerHeight-window.innerHeight),
}
}
function insert_page_info() {
page_info = sizes()
node = document.createElement('meta')
node.setAttribute('name'.'page_visiability_info')
node.setAttribute('page_info'.JSON.stringify(page_info))
document.getElementsByTagName('head') [0].appendChild(node)
}
insert_page_info()
body = get_body()
iter_node(body)
}
insert_visiability_info()
Copy the code
As shown below:
When this is done, reopen the Elements TAB and you can see that the desired attributes have been added to each node.
If you want to use the Puppeteer or Selenium to achieve the same crawler, want to batch automation JavaScript, I give a Demo, you can reference: making GeneralNewsExtractor/GneRender: Render web page to add necessary info on every dom element..
You only need to run the following commands:
yarn install
node render.js
Copy the code
You can generate test.html under the current folder, which is the special HTML that has been modified.
The latest version of GNE has been submitted to Pypi and you can now try PIP installations directly:
pip install gne
Copy the code