Python crawler engineers have a commonly used library BeautifulSoup to extract data, while Golang has a corresponding library soup. Since I prefer Python to write crawlers, I naturally thought of soup, and this article is to experience it.
Install the soup
Soup is a third-party library that requires manual installation:
❯ go get github.com/anaskhan96/soupCopy the code
The use of soup
As in the previous exercise, we define headers, but the Soup library only opens the Get method to receive URL parameters. Headers and cookies can also be defined in other soups:
import (
"fmt"
"log"
"strconv"
"strings"
"time"
"github.com/anaskhan96/soup"
)
func fetch(url string) soup.Root {
fmt.Println("Fetch Url", url)
soup.Headers = map[string]string{
"User-Agent": "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)",
}
source, err := soup.Get(url)
if err != nil {
log.Fatal(err)
}
doc := soup.HTMLParse(source)
return doc
}Copy the code
The built-in NET/HTTP package is not used this time. Soup supports setting Headers (and Cookies) directly, as well as customizing Headers and Cookies. Soup.Get(URL) is then used to retrieve document objects using soup.
Then look at douban movies Top250 single entry part of the relevant HTML code:
<ol class="grid_view"> <li> <div class="item"> <div class="info"> <div class="hd"> <a Href = "https://movie.douban.com/subject/1292052/" class = "" > < span class =" title "> shawshank redemption < / span > < span class =" title "> / The </ SPAN > </span> </span> </span> </span> </span> </div> </div> </div> </li> .... </ol>Copy the code
It’s the same requirement: get the item ID and title. This time you need to change the logic of parseUrls to the soup version:
func parseUrls(url string, ch chan bool) {
doc := fetch(url)
for _, root := range doc.Find("ol", "class", "grid_view").FindAll("div", "class", "hd") {
movieUrl, _ := root.Find("a").Attrs()["href"]
title := root.Find("span", "class", "title").Text()
fmt.Println(strings.Split(movieUrl, "/")[4], title)
}
time.Sleep(2 * time.Second)
ch <- true
}Copy the code
As you can see, goQuery and GoQuery both use the Find method name, but the arguments are in a different form, passing three tags: tag name, type, and specific value. You can use FindAll if there are more than one (Find is to Find the first one). To find the value of an attribute, use the Attrs method, obtained from the map.
I’ll get the Text again using Text. It also has an Each method like goQuery, which requires you to manually write a loop in for range format.
Afterword.
I have a basic understanding of the library through this crawler, and I think soup in general is adequate to use
The code address
The full code can be found at this address.