I have always used Python to write crawlers, but RECENTLY I want to experience the feeling of Golang writing crawlers, so I have this series. I want to crawl the page is douban Top250 page, choose it for 3 reasons:
- Douban page code is relatively standard
- Douban is more tolerant of reptile lovers
- Top250 simple page, very suitable for practice
Let’s look at the first version of the code.
I logically split the fetching code into two parts:
- The HTTP request
- Parse the contents of the page
Golang’s HTTP request library doesn’t need to use a third party library. The standard library has good enough built-in support:
import ( "fmt" "net/http" "io/ioutil" ) func fetch (url string) string { fmt.Println("Fetch Url", url) client := &http.Client{} req, _ := http.NewRequest("GET", url, nil) req.Header.Set("User-Agent", "Mozilla / 5.0 (compatible; Googlebot / 2.1; +http://www.google.com/bot.html)") resp, err := client.Do(req) if err ! = nil { fmt.Println("Http get err:", err) return "" } if resp.StatusCode ! = 200 { fmt.Println("Http status code:", resp.StatusCode) return "" } defer resp.Body.Close() body, err := ioutil.ReadAll(resp.Body) if err ! = nil { fmt.Println("Read error", err) return "" } return string(body) }Copy the code
I put all the URL request logic in the fetch function, which does some exception handling. There are two things worth saying:
- User-agent is set up in the Header to make the access look more like a search engine Bot. If a site wants its content to be included in Google, it won’t deny access to such UA.
- The body content of the RESP needs to be read by ioutil.ReadAll, and finally converted to a string with String (body)
Next comes the part that parses the page:
import ( "regexp" "strings" ) func parseUrls(url string) { body := fetch(url) body = strings.Replace(body, "\n", "", -1) rp := regexp.MustCompile(`<div class="hd">(.*?) </div>`) titleRe := regexp.MustCompile(`<span class="title">(.*?) </span>`) idRe := regexp.MustCompile(`<a href="https://movie.douban.com/subject/(\d+)/"`) items := rp.FindAllStringSubmatch(body, -1) for _, item := range items { fmt.Println(idRe.FindStringSubmatch(item[1])[1], titleRe.FindStringSubmatch(item[1])[1]) } }Copy the code
In this article, we’ll focus on parsing pages using the standard library, the regular expression package Regexp. Note that strings.replace (body, “\n”, “”, -1) is used to remove carriage returns from the body content, otherwise the regular expression.* will not match. The FindAllStringSubmatch method parses the results that match the regular expression (a list), whereas FindStringSubmatch finds the first result that matches the regular expression.
Top250 page is to turn the page, finally in the main function to achieve the capture of all Top250 pages. In addition, in order to compare with the subsequent improvements, we add the logic of the code runtime:
import (
"time"
"strconv"
)
func main() {
start := time.Now()
for i := 0; i < 10; i++ {
parseUrls("https://movie.douban.com/top250?start=" + strconv.Itoa(25 * i))
}
elapsed := time.Since(start)
fmt.Printf("Took %s", elapsed)
}Copy the code
To convert numbers to strings in Golang, use strconv.Itoa, which spells out the correct page path based on the mismatch of the start parameters. Complete the page turn with a for loop.
It runs very fast:
❯ go run crawler/doubanCrawler1. Go... Took 1.454627547sCopy the code
From the terminal output, we can see that we have got the ID and title of the corresponding movie entry!
The code address
The full code can be found at this address.