BeautifulSoup’s replacement library, Soup, and Pyquery’s replacement library, GoQuery, have been introduced in this series, but my favorite page parsing combination for writing Python crawlers is LXML +XPath. Why is that? Let’s talk about the advantages of LXML and XPath, respectively

lxml

LXML is an HTML/XML parser implemented in C with the P ython binding of libxml2 and IBXSLT. In addition to high efficiency, another feature of the document is its strong fault tolerance.

XPath

XPath, which stands for XML Path Language, is a Language for finding information in XML documents. It was originally used for searching XML documents, but it is also suitable for searching HTML documents. By writing the corresponding path expression or use the built-in standard function, can easily get directly to want anything, don’t want to use the Find method like soup and goquery Find a node of the chain and then use the method of Text, or the corresponding value (that is, a code is to get the results), which is its characteristics and advantages, And LXML happens to support XPath, so LXML +XPath has always been my first choice to write crawlers.

XPath has a higher learning curve than BeautifulSoup(Soup) or Pyquery(GoQuery), but it’s very valuable to learn and you’ll love it. You see, I learned XPath by writing a crawler in Python, and now I can just find a library that supports XPath and use it directly.

On the other hand, if you really like BeautifulSoup, make sure to BeautifulSoup+ LXML because the default HTML parser on BeautifulSoup is html.parser from the Python standard library, although the document is very fault-tolerant. But it’s much less efficient.

I learned XPath through W3School, and you can find links from the extended reading

The Xpath library in Golang

There are many Xpath libraries written in Golang, and since I don’t have any real development experience, I’ll try out the few libraries I can find and come to a conclusion.

First put douban Top250 part of the HTML code posted

<ol class="grid_view"> <li> <div class="item"> <div class="info"> <div class="hd"> <a Href = "https://movie.douban.com/subject/1292052/" class = "" > < span class =" title "> shawshank redemption < / span > < span class =" title "> & have spent /&nbsp; The Shawshank Redemption</span> <span class="other">&nbsp; /&nbsp; Black fly (Hong Kong)/exciting 1995 (Taiwan) < / span > < / a > < span class = "playable" > [can play] < / span > < / div > < / div > < / div > < / li >... </ol>Copy the code

It’s the same requirement: get the item ID and title

github.com/lestrrat-go/libxml2

Lestrrat-go /libxml2 is a Golang binding library for libxml2,

Install it first:

❯ go get github.com/lestrrat-go/libxml2
Copy the code

I’m gonna change the code

import (
        "log"
        "time"
        "strings"
        "strconv"
        "net/http"

        "github.com/lestrrat-go/libxml2"
        "github.com/lestrrat-go/libxml2/types"
        "github.com/lestrrat-go/libxml2/xpath"
)

func fetch(url string) types.Document {
        log.Println("Fetch Url", url)
        client := &http.Client{}
        req, _ := http.NewRequest("GET", url, nil)
        req.Header.Set("User-Agent"."Mozilla / 5.0 (compatible; Googlebot / 2.1; +http://www.google.com/bot.html)")
        resp, err := client.Do(req)
        iferr ! =nil {
                log.Fatal("Http get err:", err)
        }
        ifresp.StatusCode ! =200 {
                log.Fatal("Http status code:", resp.StatusCode)
        }
        defer resp.Body.Close()
        doc, err := libxml2.ParseHTMLReader(resp.Body)
        iferr ! =nil {
                log.Fatal(err)
        }
        return doc
}
Copy the code

The fetch function is the same as the whole one, doc is obtained with libxml2.parsehtmlReader (resp.body). The changes to parseUrls are larger:

func parseUrls(url string, ch chan bool) {
        doc := fetch(url)
        defer doc.Free()
        nodes := xpath.NodeList(doc.Find(`//ol[@class="grid_view"]/li//div[@class="hd"]`))
        for _, node := range nodes {
                urls, _ := node.Find("./a/@href")
                titles, _ := node.Find(`.//span[@class="title"]/text()`)
                log.Println(strings.Split(urls.NodeList()[0].TextContent(), "/") [4],
                        titles.NodeList()[0].TextContent())
        }
        time.Sleep(2 * time.Second)
        ch <- true
}
Copy the code

I personally find the interface designed by libxml2 to be a bit of a pain in the backside, with NodeList()[index].textContent being used every time to get a match.

Find(“/foo/bar”)) to get the result of the corresponding xpath statement.

github.com/antchfx/htmlquery

Htmlquery, as its name implies, is a package that does XPath queries on HTML documents. Its core is ANTchFX /xpath, with frequent project updates and complete documentation.

Install it first:

❯ go get github.com/antchfx/htmlquery
Copy the code

Then modify as required:

import (
    "log"
    "time"
    "strings"
    "strconv"
    "net/http"

    "golang.org/x/net/html"
    "github.com/antchfx/htmlquery"
)

func fetch(url string) *html.Node {
    log.Println("Fetch Url", url)
    client := &http.Client{}
    req, _ := http.NewRequest("GET", url, nil)
    req.Header.Set("User-Agent"."Mozilla / 5.0 (compatible; Googlebot / 2.1; +http://www.google.com/bot.html)")
    resp, err := client.Do(req)
    iferr ! =nil {
        log.Fatal("Http get err:", err)
    }
    ifresp.StatusCode ! =200 {
        log.Fatal("Http status code:", resp.StatusCode)
    }
    defer resp.Body.Close()
    doc, err := htmlquery.Parse(resp.Body)
    iferr ! =nil {
        log.Fatal(err)
    }
    return doc
}
Copy the code

The fetch function mainly modifies htmlQuery.parse (resp.body) and the return value type * html.node. Take a look at parseUrls:

func parseUrls(url string, ch chan bool) {
    doc := fetch(url)
    nodes := htmlquery.Find(doc, `//ol[@class="grid_view"]/li//div[@class="hd"]`)
    for _, node := range nodes {
        url := htmlquery.FindOne(node, "./a/@href")
        title := htmlquery.FindOne(node, `.//span[@class="title"]/text()`)
        log.Println(strings.Split(htmlquery.InnerText(url), "/") [4],
            htmlquery.InnerText(title))
    }
    time.Sleep(2 * time.Second)
    ch <- true
}
Copy the code

Antchfx/htmlQuery is a better experience than lestrrat-go/libxml2. Find is the list of nodes that matches and FindOne is the first node that matches.

Afterword.

This crawler gives you a basic understanding of these two libraries, as well as a gopkg.in/xmlpath.v2 that I haven’t written, mainly because it hasn’t been updated for a long time (last updated in 2015).

Just to mention Gopkg. In, gopkg is a package management method that “proxies” packages for the corresponding branch of the corresponding project on Github in a conventional way. See the extended reading link 2 for details.

The xmlpath.v2 package is the Go-xmlPath/xmlPath on Github, branching v2.

I recommend antchFX/htmlQuery for these libraries, the interface is better. I have more experience with performance and functionality, but I will write later if I find other issues.

Original address: strconv.com/posts/web-c…

The code address

The full code can be found at this address.

read

  1. www.w3school.com.cn/xpath/xpath…
  2. labix.org/gopkg.in