Python crawler engineers have a commonly used library BeautifulSoup to extract data, while Golang has a corresponding library soup. Since I prefer Python to write crawlers, I naturally thought of soup, and this article is to experience it.

Install the soup

Soup is a third-party library that requires manual installation:

❯ go get github.com/anaskhan96/soup
Copy the code

The use of soup

As in the previous exercise, we define headers, but the Soup library only opens the Get method to receive URL parameters. Headers and cookies can also be defined in other soups:

import (
    "fmt"
    "log"
    "strconv"
    "strings"
    "time"

    "github.com/anaskhan96/soup"
)

func fetch(url string) soup.Root {
    fmt.Println("Fetch Url", url)
    soup.Headers = map[string]string{
        "User-Agent": "Mozilla / 5.0 (compatible; Googlebot / 2.1; +http://www.google.com/bot.html)",
    }

    source, err := soup.Get(url)
    iferr ! =nil {
        log.Fatal(err)
    }
    doc := soup.HTMLParse(source)
    return doc
}
Copy the code

The built-in NET/HTTP package is not used this time. Soup supports setting Headers (and Cookies) directly, as well as customizing Headers and Cookies. Soup.Get(URL) is then used to retrieve document objects using soup.

Then look at douban movies Top250 single entry part of the relevant HTML code:

<ol class="grid_view"> <li> <div class="item"> <div class="info"> <div class="hd"> <a Href = "https://movie.douban.com/subject/1292052/" class = "" > < span class =" title "> shawshank redemption < / span > < span class =" title "> & have spent /&nbsp; The Shawshank Redemption</span> <span class="other">&nbsp; /&nbsp; Black fly (Hong Kong)/exciting 1995 (Taiwan) < / span > < / a > < span class = "playable" > [can play] < / span > < / div > < / div > < / div > < / li >... </ol>Copy the code

It’s the same requirement: get the item ID and title. This time you need to change the logic of parseUrls to the soup version:

func parseUrls(url string, ch chan bool) {
	doc := fetch(url)
	for _, root := range doc.Find("ol"."class"."grid_view").FindAll("div"."class"."hd") {
		movieUrl, _ := root.Find("a").Attrs()["href"]
		title := root.Find("span"."class"."title").Text()
		fmt.Println(strings.Split(movieUrl, "/") [4], title)
	}
	time.Sleep(2 * time.Second)
	ch <- true
}
Copy the code

As you can see, goQuery and GoQuery both use the Find method name, but the arguments are in a different form, passing three tags: tag name, type, and specific value. You can use FindAll if there are more than one (Find is to Find the first one). To find the value of an attribute, use the Attrs method, obtained from the map.

I’ll get the Text again using Text. It also has an Each method like goQuery, which requires you to manually write a loop in for range format.

Afterword.

I have a basic understanding of the library through this crawler, and I think soup in general is adequate to use

The code address

The full code can be found at this address.