Simple crawler: Python, Golang, and GraphQuery

This paper will use Python, Golang and GraphQuery respectively to parse the material details page of a website. This page is characterized by a clear data structure, but the DOM structure is not standardized enough to locate page elements through a separate selector, causing some twists and turns in page parsing. Through the parsing process of this page, we can understand the similarities and differences between the parsing ideas of crawlers and these languages in a simple way.

One, foreword

In the introduction, to prevent unnecessary confusion in later chapters, we will first cover some basic programming concepts.

1. Semantic DOM structure

The semantic DOM structure we are talking about here includes not only semantic HTML tags, but also semantic selectors. It should be noted in front-end development that all dynamic text should have a separate HTML tag wrapped around it, preferably with a semantic class attribute or ID attribute. This is useful for both front-end and back-end development during iterations of version functionality, such as the following HTML code:

<div class="main-right FR "> <p> Ref: 32490230</p> <p class=" main-rightstage "> </p> <p class=" main-rightstage ">Copy the code

32504070, RGB, 16.659 MB, 72dpi these values are dynamic attributes that change with the number. In the development of the specification, these dynamic attributes should be wrapped with inline tags such as < SPAN >. In the HTML structure above, it can be inferred that this is a page rendered directly by the back end using foreach, which is not in line with the idea of separating the front end from the back end. If one day they decide to render these attributes using JSONP or Ajax, and the front end does the rendering, The workload will definitely go up to a new level. Semantic DOM structures tend to look like this:

<p class=" main-rightstage property-mode"> <span>RGB</span> </p>Copy the code

Property-mode can also be used as a span class attribute directly, so that these attributes, whether back-end rendering or front-end dynamic rendering, can reduce the burden of product iteration.

2. Stable parsing code

After the semantic DOM structure, let’s talk about stable parsing code for the following DOM structure:

<div class="main-right FR "> <p> Ref: 32490230</p> <p class=" main-rightstage "> </p> <p class=" main-rightstage ">Copy the code

If we want to extract schema information, of course we can take the following steps:

  1. selectclassProperty containsmain-rightdiv
  2. Select thedivIn the secondpElement to retrieve the text it contains
  3. Delete textMode:, the mode isRGB

Although the desired result has been successfully obtained, such a parsing method is considered to be unstable. This instability refers to the situation that parsing errors or failures occur when the structure of elements other than its ancestor element, sibling element and other elements changes to a certain extent. For example, if one day a size attribute is added before the node where the pattern resides:

<div class="main-right fr"> <p> number: 32490230</p> </p> <p class=" main-rightstage "> Mode: RGB</p> <p class=" main-rightstage "> Volume: </p> <p class=" main-rightstage ">Copy the code

Then our previous parsing will have an error (what? You don’t think that’s possible? Please compare Page1 and Page2). < span style = “box-sizing: border-box; color: RGB (74, 74, 74); display: block; line-height: 22px; font-size: 14px! Important; word-break: inherit! Important;” After the content, the disadvantage is too much logic, difficult to maintain and reduce the code readability. 2. Use the regular expression mode: ([A-z]+) for matching. However, improper use may cause efficiency problems. Using the contains method in the CSS selector, such as.main-rightStage: Contains (pattern), you can select a node that contains a pattern in its text and a main-rightstage in its class property. But the disadvantage is that different languages and different libraries support this syntax to varying degrees, lack of compatibility. Depending on which approach you use, parsing stability, complexity, efficiency, and compatibility can vary, and developers need to balance these factors in order to write the best parsing code.

Two, the page analysis

Before extracting page data, the first thing to do is to make clear what data we need, what data is provided on the page, and then design the data structure we need. Firstly, open the page to be resolved. Since the data of views, favorites, downloads and other data at the top are dynamically loaded, they are not needed in our demonstration temporarily. However, the data of size, mode and other data on the right of the page are not necessarily existing through the comparison of Page1 and Page2 above. They are therefore grouped together in metainfo. Therefore, the data we need to obtain is shown in the figure below:

From this we can quickly design our data structure:

{
    title
    pictype
    number
    type
    metadata {
        size
        volume
        mode
        resolution
    }
    author
    images []
    tags []
}Copy the code

Size, volume, mode and resolution may not exist, so they are grouped under metadata. Images are an array of image addresses, while tags are an array of tags. After determining the data structure to be extracted, parsing can be started.

Use Python for page parsing

The number of Python libraries is very large, and there are many excellent libraries to help us. When using Python for page parsing, we usually use the following libraries:

  1. provideRegular expressionTo support there
  2. provideCSS selectorsTo support thepyquerybeautifulsoup4
  3. provideXpathTo support thelxml
  4. provideJSON PATHTo support thejsonpath_rw

These libraries are supported under Python 3 and can be installed through PIP Install. We chose PyQuery between 2 and 3 because the CSS selector syntax is more concise than Xpath syntax and PyQuery is more convenient than Beautifulsoup4 for method calls. We’ll use the title and type attributes as examples, and the same goes for other nodes. First we’ll use the Requests library to download the source file for this page:

import requests
from pyquery import PyQuery as pq
response = requests.get("http://www.58pic.com/newpic/32504070.html")
document = pq(response.content.decode('gb2312'))Copy the code

The following parsing in Python will proceed as follows.

1. Obtain the title node

Open the page to be parsed, right-click on the title, and click View element. You can see its DOM structure as follows:

At this time we noticed that we wanted to extract the title text hero poster Jin Yong Wuxia ink Chinese style black and white, is not wrapped in HTML tags, which is not in line with the semantic DOM structure we mentioned above. It is also impossible to select this text node directly using CSS selectors (you can select it directly using Xpath, which is omitted in this article). For such a node, we can have the following two ideas: First, select the parent element node, obtain its HTML content, use regular expression, match the text between and

title_node = document.find(".detail-title")
title_node.find("div").remove()
title_node.find("p").remove()
print(title_node.text())Copy the code

The output is as we expected, for the warrior poster Jin Yong wuxia ink in black and white Chinese style.

2. Obtain the size node

Right-click on the size to view the element and you can see the DOM structure shown below:

We find that these nodes do not have semantic selectors, and these attributes are not always present (see Page1 vs. Page2 for details). In the stable parsing code we also talked about several ideas for this structure of the document, here we use the regular parsing method:

Import re context = document.find(".mainright-file ").text() file_type_matches = re.compile(" dimensions: (.*? Pixels)").findAll (context) fileType = "" if len(file_type_matches) > 0: fileType = file_type_matches[0] print(fileType)Copy the code

Since similar methods can be adopted to obtain the attributes of size, volume, mode and resolution, we can conclude a function of regular extraction:

def regex_get(text, expr):
    matches = re.compile(expr).findall(text)
    if len(matches) == 0:
        return ""
    return matches[0]Copy the code

Therefore, when getting the size node, our code can be simplified to:

Size = regex_get(context, r" size :(.*? Pixels) ")Copy the code

3. Complete Python code

At this point, we’ve solved most of the problems we might encounter parsing the page. The entire Python code looks like this:

import requests import re from pyquery import PyQuery as pq def regex_get(text, expr): matches = re.compile(expr).findall(text) if len(matches) == 0: Return "" return matches [0] conseq response = = {} # # download document requests. Get (" http://www.58pic.com/newpic/32504070.html") Title_node = document.find(".detail-title") title_node.find("div").remove() Title_node.find ("p").remove() conseq["title"] = title_node.text() ## Get the conseq["pictype"] = Document.find (".pic-type").text() ## Get file formats conseq["filetype"] = regex_get(document.find(".mainright-file ").text(), Context = document.find(".main-right p").text() conseq['metainfo'] = {"size": Regex_get (context, r "size: (. *? Pixels)"), "volume" : regex_get (context, r "size: (. *? MB)"), "mode" : Regex_get (context, r "pattern: [a-z] +)"), "resolution" : Regex_get (context, r" (\d+dpi)"), } ## Get conseq['author'] = document.find('.user-name').text() ## get pictures from conseq['images'] = [] for node_image in document.find("#show-area-height img"): Conseq ['images'].append(pq(node_image).attr(" SRC ")) ## Get tag conseq['tags'] = [] for node_image in document.find(".mainRight-tagBox .fl"): conseq['tags'].append(pq(node_image).text()) print(conseq)Copy the code

Use Golang for page parsing

To parse HTML and XML documents in Golang, the following libraries are commonly used:

  1. provideRegular expressionTo support theregexp
  2. provideCSS selectorsTo support thegithub.com/PuerkitoBio/goquery
  3. provideXpathTo support thegopkg.in/xmlpath.v2
  4. provideJSON PATHTo support thegithub.com/tidwall/gjson

You can get all of these libraries by going get-u. Since we’ve sorted out the parsing logic in Python above, in Golang we just need to duplicate it. Unlike Python, we’d better define a struct for our data structure first, like the following:

type Reuslt struct {
    Title    string
    Pictype  string
    Number   string
    Type     string
    Metadata struct {
        Size       string
        Volume     string
        Mode       string
        Resolution string
    }
    Author string
    Images []string
    Tags   []string
}Copy the code

At the same time, since the page to be parsed is non-mainstream GBK encoding, after downloading the document, we need to manually convert the UTF-8 encoding into GBK encoding. Although this process is not within the scope of parsing, it is also one of the steps that must be done. We use github.com/axgle/mahonia library for encoding conversion, and sorted out the encoding conversion function decoderConvert:

func decoderConvert(name string, body string) string {
    return mahonia.NewDecoder("gbk").ConvertString(body)
}Copy the code

Therefore, the final Golang code should look something like this:

package main import ( "encoding/json" "log" "regexp" "strings" "github.com/axgle/mahonia" "github.com/parnurzeal/gorequest" "github.com/PuerkitoBio/goquery" ) type Reuslt struct { Title string Pictype string Number string Type string Metadata struct { Size string Volume string Mode string Resolution string } Author string Images []string Tags []string } func RegexGet(text string, expr string) string { regex, _ := regexp.Compile(expr) return regex.FindString(text) } func decoderConvert(name string, Body String) string {return mahonia.NewDecoder(" GBK ").convertString (body)} func main() {// Download document request := gorequest.New() _, body, _ := request.Get("http://www.58pic.com/newpic/32504070.html").End() document, err := goquery.NewDocumentFromReader(strings.NewReader(decoderConvert("gbk", body))) if err ! TitleNode := document.find (".detail-title") Titlenode.find ("div").remove () titlenode.find ("p").remove () conseq.title = titlenode.text () // Gets the material type conseq.pictype = Document.find (".pic-type").text () // gets the file format conseq.type = document.find (".mainright-file ").text () // gets the metadata context := Document.find (".main-right p").text () conseq.metadata. Mode = RegexGet(context, 'size: (. *?) pixels `) conseq. Metadata. The Resolution = RegexGet context, ` volume: (. *? (MB) `) conseq. Metadata. Size = RegexGet (context, ` mode: ([a-z]+) ') conseq.metadata. Volume = RegexGet(context, 'resolution: Author = document.find (".user-name").text () document.find ("#show-area-height" img").Each(func(i int, element *goquery.Selection) { if attribute, exists := element.Attr("src"); exists && attribute ! = "" { conseq.Images = append(conseq.Images, Tag document.find (".mainright-tagbox.fl ").Each(func(I int, element *goquery.Selection) { conseq.Tags = append(conseq.Tags, element.Text()) }) bytes, _ := json.Marshal(conseq) log.Println(string(bytes)) }Copy the code

The parsing logic is exactly the same, with the same amount of code and complexity as the Python version. Let’s take a look at what the new GraphQuery does.

Use GraphQuery for parsing

Given that the data structure we want to obtain is as follows:

{
    title
    pictype
    number
    type
    metadata {
        size
        volume
        mode
        resolution
    }
    author
    images []
    tags []
}Copy the code

The code for GraphQuery looks like this:

{ title `xpath("/html/body/div[4]/div[1]/div/div/div[1]/text()")` pictype `css(".pic-type")` number `css(".detailBtn-down"); Attr (data - "id") ` type ` regex (" file formats: [a-z] +) ") ` metadata ` CSS (" p "main - right) ` {size ` regex (" size: (. *?) Pixel ") 'volume' regex(" volume (.*? MB) ") ` mode ` regex (" mode: [a-z] +) ") ` resolution ` regex (" resolution: (\d+dpi)")` } author `css(".user-name")` images `css("#show-area-height img")` [ src `attr("src")` ] tags `css(".mainRight-tagBox .fl")` [ tag `text()` ] }Copy the code

As you can see by comparison, it simply adds some functions enclosed in backquotes to the data structure we designed. Amazingly, it completely restores the parsing logic we’ve seen above in Python and Golang, and makes it easier to read the returned data structure from its syntax. The result of this GraphQuery is as follows:

{ "data": { "author": "Ice bear", "images": [ "http://pic.qiantucdn.com/58pic/32/50/40/70d58PICZfkRTfbnM2UVe_PIC2018.jpg! /fw/1024/watermark/url/L2ltYWdlcy93YXRlcm1hcmsvZGF0dS5wbmc=/repeat/true/crop/0x1024a0a0", "http://pic.qiantucdn.com/58pic/32/50/40/70d58PICZfkRTfbnM2UVe_PIC2018.jpg! /fw/1024/watermark/url/L2ltYWdlcy93YXRlcm1hcmsvZGF0dS5wbmc=/repeat/true/crop/0x1024a0a1024", "http://pic.qiantucdn.com/58pic/32/50/40/70d58PICZfkRTfbnM2UVe_PIC2018.jpg! /fw/1024/watermark/url/L2ltYWdlcy93YXRlcm1hcmsvZGF0dS5wbmc=/repeat/true/crop/0x1024a0a2048", "http://pic.qiantucdn.com/58pic/32/50/40/70d58PICZfkRTfbnM2UVe_PIC2018.jpg! /fw/1024/watermark/url/L2ltYWdlcy93YXRlcm1hcmsvZGF0dS5wbmc=/repeat/true/crop/0x1024a0a3072" ], "metadata": { "mode": "RGB", "resolution": "200dpi", "size": "4724×6299", "volume": "196.886 MB"}, "number": "32504070", "picType ": "Original", "tags" : [" warrior ", "poster", "black and white", "jin yong", "ink painting", "martial arts", "Chinese wind"], "title" : "warrior posters jin yong's martial arts ink Chinese wind of black and white", "type" : "PSD"}, "error" : "", "timecost": 10997800 }Copy the code

GraphQuery is a text query language that does not depend on any backend language. It can be called by any backend language. A GraphQuery statement can get the same parsing result in any language. It has built-in xpath selectors, CSS selectors, JSONPath selectors and regular expressions, as well as a large number of text processing functions, the structure is clear and easy to read, can ensure the consistency of data structure, parsing code, return result structure.

Project address: github.com/storyicon/g…

GraphQuery’s syntax is simple and easy to learn, even if you’re new to it. One of its syntax design ideas is intuitive. How do we implement it?

1. Call GraphQuery in Golang

In golang, you only need to use the go get -u github.com/storyicon/graphquery GraphQuery and call in your code:

package main import ( "log" "github.com/axgle/mahonia" "github.com/parnurzeal/gorequest" "github.com/storyicon/graphquery" ) func decoderConvert(name string, body string) string { return mahonia.NewDecoder("gbk").ConvertString(body) } func main() { request := gorequest.New() _,  body, _ := request.Get("http://www.58pic.com/newpic/32504070.html").End() body = decoderConvert("gbk", body) response := graphquery.ParseFromString(body, "{ title `xpath(\"/html/body/div[4]/div[1]/div/div/div[1]/text()\")` pictype `css(\".pic-type\")` number `css(\".detailBtn-down\"); Attr (\ "\" data - id) ` type ` regex (\ "file formats: (+) \ [a-z]") ` metadata ` CSS (\ ". The main - right p \ ") ` {size ` regex (\ "size: (. *?) Pixel \") 'volume' regex(\" Volume (.*? MB) \ ") ` mode ` regex (\ "mode: (+) \ [a-z]") ` resolution ` regex (\ "resolution: (\\d+dpi)\")` } author `css(\".user-name\")` images `css(\"#show-area-height img\")` [ src `attr(\"src\")` ] tags `css(\".mainRight-tagBox .fl\")` [ tag `text()` ] }") log.Println(response) }Copy the code

Our GraphQuery expression in the form of single row, as a function GraphQuery. Incoming ParseFromString second parameter, and expected results are exactly the same.

2. Call GraphQuery in Python

In other back-end languages such as Python, calling GraphQuery requires starting its service, which is already compiled for Windows, MAC, and Linux, and can be downloaded from GraphQuery-HTTP. After unpacking and starting the service, we’re happy to use GraphQuery to graphically parse any document in any back-end language. Example code for a Python call is as follows:

Import requests def GraphQuery(document, expr): response = requests. Post ("http://127.0.0.1:8559", data={"document": document, "expression": expr, }) return response.text response = requests.get("http://www.58pic.com/newpic/32504070.html") conseq = GraphQuery(response.text, r""" { title `xpath("/html/body/div[4]/div[1]/div/div/div[1]/text()")` pictype `css(".pic-type")` number `css(".detailBtn-down"); Attr (data - "id") ` type ` regex (" file formats: [a-z] +) ") ` metadata ` CSS (" p "main - right) ` {size ` regex (" size: (. *?) Pixel ") 'volume' regex(" volume (.*? MB) ") ` mode ` regex (" mode: [a-z] +) ") ` resolution ` regex (" resolution: (\d+dpi)")` } author `css(".user-name")` images `css("#show-area-height img")` [ src `attr("src")` ] tags `css(".mainRight-tagBox .fl")` [ tag `text()` ] } """) print(conseq)Copy the code

The output is:

{ "data": { "author": "Ice bear", "images": [ "http://pic.qiantucdn.com/58pic/32/50/40/70d58PICZfkRTfbnM2UVe_PIC2018.jpg! /fw/1024/watermark/url/L2ltYWdlcy93YXRlcm1hcmsvZGF0dS5wbmc=/repeat/true/crop/0x1024a0a0", "http://pic.qiantucdn.com/58pic/32/50/40/70d58PICZfkRTfbnM2UVe_PIC2018.jpg! /fw/1024/watermark/url/L2ltYWdlcy93YXRlcm1hcmsvZGF0dS5wbmc=/repeat/true/crop/0x1024a0a1024", "http://pic.qiantucdn.com/58pic/32/50/40/70d58PICZfkRTfbnM2UVe_PIC2018.jpg! /fw/1024/watermark/url/L2ltYWdlcy93YXRlcm1hcmsvZGF0dS5wbmc=/repeat/true/crop/0x1024a0a2048", "http://pic.qiantucdn.com/58pic/32/50/40/70d58PICZfkRTfbnM2UVe_PIC2018.jpg! /fw/1024/watermark/url/L2ltYWdlcy93YXRlcm1hcmsvZGF0dS5wbmc=/repeat/true/crop/0x1024a0a3072" ], "metadata": { "mode": "RGB", "resolution": "200dpi", "size": "4724×6299", "volume": "196.886 MB"}, "number": "32504070", "picType ": "Original", "tags" : [" warrior ", "poster", "black and white", "jin yong", "ink painting", "martial arts", "Chinese wind"], "title" : "warrior posters jin yong's martial arts ink Chinese wind of black and white", "type" : "PSD"}, "error" : "", "timecost": 10997800 }Copy the code

Three, afterword.

Complex parsing logic is not only a code readability problem, on the code maintenance and transplantation can cause a lot of different language and different libraries for code parsing results also caused the differences, GraphQuery is a new open source project, its purpose is to let developers from these repeated tedious parsing logic, Write highly readable, portable, and maintainable code. You are welcome to practice, keep watching, and contribute to the GraphQuery and open source community!