When writing crawlers, regular expressions are usually not written directly when you want to select HTML content and find matches: regular expressions are not readable and maintainable. There are a lot of options for writing crawlers in Python. One of them is pyQuery, which is commonly used by developers. Golang also has its corresponding goQuery, which is the Golang version of jQuery. Using the syntax of the jQueryCSS selector, you can implement content matching and lookup in a very simple way.
Install goquery
Goquery is a third-party library that needs to be installed manually:
❯ go get github.com/PuerkitoBio/goqueryCopy the code
Create a document
The main exposed structure of goQuery is goQuery.document, which is typically created in two ways:
doc, error := goquery.NewDocumentFromReader(reader io.Reader)
doc, error := goquery.NewDocument(url string)Copy the code
The second method is passed directly to the URL, but we usually make a lot of customizations to the request (such as adding headers, setting cookies, etc.), so the first method is commonly used, and our code also needs to make corresponding changes:
import ( "fmt" "log" "net/http" "strconv" "time" "github.com/PuerkitoBio/goquery" ) func fetch(url string) *goquery.Document { ... defer resp.Body.Close() doc, err := goquery.NewDocumentFromReader(res.Body) if err ! = nil { log.Fatal(err) } return docCopy the code
Res.body is returned as a character. Now it returns a doc of type goQuery.document
CSS selectors
This article does not cover the syntax of selectors, but if you are not familiar with it, you can go to the extended reading link 1 at the end of this article.
Let’s first look at douban movies Top250 single entry part of the relevant HTML code:
<ol class="grid_view"> <li> <div class="item"> <div class="info"> <div class="hd"> <a Href = "https://movie.douban.com/subject/1292052/" class = "" > < span class =" title "> shawshank redemption < / span > < span class =" title "> / The </ SPAN > </span> </span> </span> </span> </span> </div> </div> </div> </li> .... </ol>Copy the code
It’s the same requirement: get the item ID and title. This time you need to change the logic of parseUrls to the version that uses GoQuery:
func parseUrls(url string, ch chan bool) {
doc := fetch(url)
doc.Find("ol.grid_view li").Find(".hd").Each(func(index int, ele *goquery.Selection) {
movieUrl, _ := ele.Find("a").Attr("href")
fmt.Println(strings.Split(movieUrl, "/")[4], ele.Find(".title").Eq(0).Text())
})
time.Sleep(2 * time.Second)
ch <- true
}Copy the code
Doc.Find takes arguments to CSS selectors, and Find supports chained calls. This means looking for all li elements whose thunder is “grid_view” in OL, and then looking for elements whose name is HD. The result of Find is a list, which needs to be iterated through the Each method. You can pass a function that takes the index index and the child element ele. The logic to get the content is in that function.
In the above example, there are two spans of the class named title, so we need to take the first one (using Eq(0)), and the Text method retrieves the contents of the element. The way to get the item ID is to get the link to the item page. (Attr is used to get the href attribute. Note that it returns two parameters: the value of the attribute and the presence of the attribute.) Now that you have the ID and title, does that make it much more readable and maintainable?
PS: In fact, the purpose of crawler practice has been achieved, to get more content is to write more logic.
The code address
The full code can be found at this address.
read
- www.w3school.com.cn/cssref/css_…
- www.itlipeng.cn/2017/04/25/…