Introduction to the

Colly is a powerful crawler framework written in the Go language. It offers a clean API, strong performance, automatic handling of Cookie&Sessions, and a flexible extension mechanism.

First, let’s introduce the basic concept of Colly. Then through several cases to introduce colly usage and features: pull GitHub Treading, pull Baidu novel hot list, download pictures from Unsplash website.

Quick to use

The code in this article uses Go Modules.

Create directory and initialize:

$ mkdir colly && cd colly
$ go mod init github.com/darjun/go-daily-lib/colly
Copy the code

Install colly library:

$ go get -u github.com/gocolly/colly/v2
Copy the code

Use:

package main

import (
  "fmt"

  "github.com/gocolly/colly/v2"
)

func main(a) {
  c := colly.NewCollector(
    colly.AllowedDomains("www.baidu.com" ),
  )

  c.OnHTML("a[href]".func(e *colly.HTMLElement) {
    link := e.Attr("href")
    fmt.Printf("Link found: %q -> %s\n", e.Text, link)
    c.Visit(e.Request.AbsoluteURL(link))
  })

  c.OnRequest(func(r *colly.Request) {
    fmt.Println("Visiting", r.URL.String())
  })

  c.OnResponse(func(r *colly.Response) {
    fmt.Printf("Response %s: %d bytes\n", r.Request.URL, len(r.Body))
  })

  c.OnError(func(r *colly.Response, err error) {
    fmt.Printf("Error %s: %v\n", r.Request.URL, err)
  })

  c.Visit("http://www.baidu.com/")}Copy the code

Colly is easy to use:

First, call colly.newCollector () to create a crawler object of type * colly.collector. Because every web page has many links to other web pages. If unchecked, the run may never stop. So it restricts crawling to pages with www.baidu.com by passing in an option called colly.AllowedDomains(“www.baidu.com”).

We then call the c.onhtml method to register the HTML callback for each a element that has an href attribute. I’m going to go ahead and visit the HREF URL. This means parsing the page that you’ve crawled and then continuing to visit links that point to other pages in the page.

Call the c.onRequest () method to register the request callback, which is executed each time a request is sent, simply printing the request URL.

Call the c.onResponse () method to register the response callback and execute it each time a response is received, again simply printing out the URL and response size.

The c.onerror () method is called to register the error callback, which is executed when an error occurs in the execution request, simply printing the URL and error information.

Finally, we call c.visit () to access the first page.

Run:

$ go run main.go
Visiting http://www.baidu.com/
Response http://www.baidu.com/: 303317 bytes
Link found: "Baidu Home Page" -> /
Link found: "Settings" -> javascript:;
Link found: "Login" -> https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F&sms=5
Link found: "News" -> http://news.baidu.com
Link found: "hao123" -> https://www.hao123.com
Link found: "Map" -> http://map.baidu.com
Link found: "Live" -> https://live.baidu.com/
Link found: "Video" -> https://haokan.baidu.com/?sfrom=baidu-top
Link found: "Post" -> http://tieba.baidu.com.Copy the code

After Colly crawls to the page, it parses the page using GoQuery. It then looks up the corresponding element selector for the registered HTML callback and wraps goQuery.Selection into a colly.htmlElement to execute the callback.

Colly.htmlelement is simply a wrapper around GoQuery.selection:

type HTMLElement struct {
  Name string
  Text string
  Request *Request
  Response *Response
  DOM *goquery.Selection
  Index int
}
Copy the code

And provides easy-to-use methods:

  • Attr(k string): returns the attributes of the current element that we used in the example abovee.Attr("href")To obtain thehrefProperties;
  • ChildAttr(goquerySelector, attrName string)Returns thegoquerySelectorOf the first child of the selected elementattrNameProperties;
  • ChildAttrs(goquerySelector, attrName string)Returns thegoquerySelectorOf all the children of the selected elementsattrNameProperties to[]stringReturn;
  • ChildText(goquerySelector string): joining togethergoquerySelectorSelect the text content of the child element and return it;
  • ChildTexts(goquerySelector string)Returns thegoquerySelectorSelect a slice of text content composed of child elements to[]stringTo return.
  • ForEach(goquerySelector string, callback func(int, *HTMLElement)): for eachgoquerySelectorThe selected child element performs the callbackcallback;
  • Unmarshal(v interface{}): You can Unmarshal an HTMLElement object into a struct instance by specifying a tag in goquerySelector format for the struct field.

These methods will be used frequently. Here are some examples to illustrate colly’s features and usage.

GitHub Treading

I have previously written an API for pulling GitHub Treading, which is easier to use with Colly:

type Repository struct {
  Author  string
  Name    string
  Link    string
  Desc    string
  Lang    string
  Stars   int
  Forks   int
  Add     int
  BuiltBy []string
}

func main(a) {
  c := colly.NewCollector(
    colly.MaxDepth(1),
  )


  repos := make([]*Repository, 0.15)
  c.OnHTML(".Box .Box-row".func (e *colly.HTMLElement) {
    repo := &Repository{}

    // author & repository name
    authorRepoName := e.ChildText("h1.h3 > a")
    parts := strings.Split(authorRepoName, "/")
    repo.Author = strings.TrimSpace(parts[0])
    repo.Name = strings.TrimSpace(parts[1])

    // link
    repo.Link = e.Request.AbsoluteURL(e.ChildAttr("h1.h3 >a"."href"))

    // description
    repo.Desc = e.ChildText("p.pr-4")

    // language
    repo.Lang = strings.TrimSpace(e.ChildText("div.mt-2 > span.mr-3 > span[itemprop]"))

    // star & fork
    starForkStr := e.ChildText("div.mt-2 > a.mr-3")
    starForkStr = strings.Replace(strings.TrimSpace(starForkStr), ","."".- 1)
    parts = strings.Split(starForkStr, "\n")
    repo.Stars , _=strconv.Atoi(strings.TrimSpace(parts[0]))
    repo.Forks , _=strconv.Atoi(strings.TrimSpace(parts[len(parts)- 1]))

    // add
    addStr := e.ChildText("div.mt-2 > span.float-sm-right")
    parts = strings.Split(addStr, "")
    repo.Add, _ = strconv.Atoi(parts[0])

    // built by
    e.ForEach("div.mt-2 > span.mr-3 img[src]".func (index int, img *colly.HTMLElement) {
      repo.BuiltBy = append(repo.BuiltBy, img.Attr("src"))
    })

    repos = append(repos, repo)
  })

  c.Visit("https://github.com/trending")
  
  fmt.Printf("%d repositories\n".len(repos))
  fmt.Println("first repository:")
  for _, repo := range repos {
      fmt.Println("Author:", repo.Author)
      fmt.Println("Name:", repo.Name)
      break}}Copy the code

We use ChildText to get author, repository name, language, number of stars and forks, new today, etc. We use ChildAttr to get repository link, which is a relative path, and turn it into an absolute path by calling e.equest.absoluteURL ().

Run:

$ go run main.go
25 repositories
first repository:
Author: Shopify
Name: dawn
Copy the code

Baidu hot list of novels

The page structure is as follows:

The structure of each part is as follows:

  • Each hot list has its owndiv.category-wrap_iQLoo;
  • aUnder the elementdiv.index_1Ew5pIs the ranking;
  • The content indiv.content_1YWBm;
  • The content ofa.title_dIF3BIs the title;
  • Two of themdiv.intro_1l0wp, the former is the author, the latter is the type;
  • The content ofdiv.desc_3CTjTIs described.

From this we define the structure:

type Hot struct {
  Rank   string `selector:"a > div.index_1Ew5p"`
  Name   string `selector:"div.content_1YWBm > a.title_dIF3B"`
  Author string `selector:"div.content_1YWBm > div.intro_1l0wp:nth-child(2)"`
  Type   string `selector:"div.content_1YWBm > div.intro_1l0wp:nth-child(3)"`
  Desc   string `selector:"div.desc_3CTjT"`
}
Copy the code

In the tag is the CSS selector syntax, which was added so that the htmlElement.unmarshal () method could be called directly to populate the Hot object.

Then create the Collector object:

c := colly.NewCollector()
Copy the code

Register callback:

c.OnHTML("div.category-wrap_iQLoo".func(e *colly.HTMLElement) {
  hot := &Hot{}

  err := e.Unmarshal(hot)
  iferr ! =nil {
    fmt.Println("error:", err)
    return
  }

  hots = append(hots, hot)
})

c.OnRequest(func(r *colly.Request) {
  fmt.Println("Requesting:", r.URL)
})

c.OnResponse(func(r *colly.Response) {
  fmt.Println("Response:".len(r.Body))
})
Copy the code

OnHTML executes Unmarshal on each entry to generate a Hot object.

OnRequest/OnResponse simply outputs debugging information.

Then, call c.visit () to visit the url:

err := c.Visit("https://top.baidu.com/board?tab=novel")
iferr ! =nil {
  fmt.Println("Visit error:", err)
  return
}
Copy the code

Finally add some debug prints:

fmt.Printf("%d hots\n".len(hots))
for _, hot := range hots {
  fmt.Println("first hot:")
  fmt.Println("Rank:", hot.Rank)
  fmt.Println("Name:", hot.Name)
  fmt.Println("Author:", hot.Author)
  fmt.Println("Type:", hot.Type)
  fmt.Println("Desc:", hot.Desc)
  break
}
Copy the code

Run output:

Requesting: https://top.baidu.com/board?tab=novel
Response: 118083
30 hots
first hot:
Rank: 1Name: The pearl of poison, inherit the blood of evil god, repair the power of evil god, a generation of evil god, rule the world! See more >Copy the code

Unsplash

I wrote the articles on the official account, and the background pictures were basically obtained from unsplash. Unsplash offers a wealth of free, rich images. One problem with this website is that it is slow to access. Since learning crawler, just use the program automatically download pictures.

The homepage of UnSplash is as follows:

The page structure is as follows:

But the front page was full of smaller images, so we clicked on a link to an image:

The page structure is as follows:

With a single colly.collector object, the OnHTML callback setup requires extra care because of the three layers of web page structure involved (img needs to be accessed once at the end), which can impose a considerable mental burden on the code. Colly supports multiple collectors. we code this way:

func main(a) {
  c1 := colly.NewCollector()
  c2 := c1.Clone()
  c3 := c1.Clone()

  c1.OnHTML("figure[itemProp] a[itemProp]".func(e *colly.HTMLElement) {
    href := e.Attr("href")
    if href == "" {
      return
    }

    c2.Visit(e.Request.AbsoluteURL(href))
  })

  c2.OnHTML("div._1g5Lu > img[src]".func(e *colly.HTMLElement) {
    src := e.Attr("src")
    if src == "" {
      return
    }

    c3.Visit(src)
  })

  c1.OnRequest(func(r *colly.Request) {
    fmt.Println("Visiting", r.URL)
  })

  c1.OnError(func(r *colly.Response, err error) {
    fmt.Println("Visiting", r.Request.URL, "failed:", err)
  })
}
Copy the code

We used three Collector objects, the first Collector to collect the corresponding image links on the home page, the second Collector to access those image links, and the third Collector to download the images. Above we also registered the request and error callback for the first Collector.

The third Collector downloads the specific image content and saves it locally:

func main(a) {
  / /... omit
  var count uint32
  c3.OnResponse(func(r *colly.Response) {
    fileName := fmt.Sprintf("images/img%d.jpg", atomic.AddUint32(&count, 1))
    err := r.Save(fileName)
    iferr ! =nil {
      fmt.Printf("saving %s failed:%v\n", fileName, err)
    } else {
      fmt.Printf("saving %s success\n", fileName)
    }
  })

  c3.OnRequest(func(r *colly.Request) {
    fmt.Println("visiting", r.URL)
  })
}
Copy the code

We used atomic.adduint32 () above to number the image.

Run the program and crawl the result:

asynchronous

By default, Colly’s crawling of web pages is synchronous, that is, one after the other, as the unplash program above does. This can take a long time, and colly provides asynchronous crawls. We simply pass the colly.async (true) option when constructing the Collector object to enable asynchrony:

c1 := colly.NewCollector(
  colly.Async(true),Copy the code

However, because the crawl is asynchronous, the program will need to wait for the Collector to finish processing. Otherwise, the program will exit main early.

c1.Wait()
c2.Wait()
c3.Wait()
Copy the code

Run again, much faster 😀.

The second edition

Scrolling down unSplash’s web page, we see that the images behind it are loaded asynchronously. Scroll to the network TAB of Chrome to view the request:

Request path /photos, set per_page and page parameters, and return a JSON array. So there’s another way:

To define the structure of each item, we keep only the necessary fields:

type Item struct {
  Id     string
  Width  int
  Height int
  Links  Links
}

type Links struct {
  Download string
}
Copy the code

The JSON is then parsed in the OnResponse callback, calling the Visit() method of the Collector responsible for downloading the image on each item’s Download link:

c.OnResponse(func(r *colly.Response) {
  var items []*Item
  json.Unmarshal(r.Body, &items)
  for _, item := range items {
    d.Visit(item.Links.Download)
  }
})
Copy the code

To initialize the access, we set the pull to 3 pages, 12 per page (the same as the number of page requests) :

for page := 1; page <= 3; page++ {
  c.Visit(fmt.Sprintf("https://unsplash.com/napi/photos?page=%d&per_page=12", page))
}
Copy the code

Run to view the downloaded image:

The speed limit

Sometimes there are too many concurrent requests and the site will restrict access. This is where LimitRule comes in. LimitRule limits access speed and concurrency:

type LimitRule struct {
  DomainRegexp string
  DomainGlob string
  Delay time.Duration
  RandomDelay time.Duration
  Parallelism    int
}
Copy the code

Commonly used will Delay/RandomDelay Parallism these a few, respectively between the request and the request of Delay and random delays, and concurrency. In addition, you must specify which domain names are restricted by DomainRegexp or DomainGlob. The Limit() method returns an error if neither field is set. In the example above:

err := c.Limit(&colly.LimitRule{
  DomainRegexp: `unsplash\.com`,
  RandomDelay:  500 * time.Millisecond,
  Parallelism:  12,})iferr ! =nil {
  log.Fatal(err)
}
Copy the code

We set a random maximum delay of 500ms between requests for the unsplash.com domain, with a maximum of 12 simultaneous requests.

Set the timeout

The http.Client used in Colly has a default timeout mechanism, which can be overridden with colly.withtransport () :

c.WithTransport(&http.Transport{
  Proxy: http.ProxyFromEnvironment,
  DialContext: (&net.Dialer{
    Timeout:   30 * time.Second,
    KeepAlive: 30 * time.Second,
  }).DialContext,
  MaxIdleConns:          100,
  IdleConnTimeout:       90 * time.Second,
  TLSHandshakeTimeout:   10 * time.Second,
  ExpectContinueTimeout: 1 * time.Second,
})
Copy the code

extension

Colly provides several extension features in the extension sub-package, the most commonly used being random User-agents. A web site uses user-Agent to identify whether a request was made by a browser, and a crawler sets this Header to disguise itself as a browser. It’s also easier to use:

import "github.com/gocolly/colly/v2/extensions"

func main(a) {
  c := colly.NewCollector()
  extensions.RandomUserAgent(c)
}
Copy the code

The random User-agent implementation is also very simple, which is to randomly assign one of the predefined user-agent arrays to the Header:

func RandomUserAgent(c *colly.Collector) {
  c.OnRequest(func(r *colly.Request) {
    r.Headers.Set("User-Agent", uaGens[rand.Intn(len(uaGens))]())
  })
}
Copy the code

It’s not hard to implement your own extensions, for example, we need to set a specific Header every time we request it.

func MyHeader(c *colly.Collector) {
  c.OnRequest(func(r *colly.Request) {
    r.Headers.Set("My-Header"."dj")})}Copy the code

Call MyHeader() with the Collector object:

MyHeader(c)
Copy the code

conclusion

Colly is the most popular crawler framework in the Go language, supporting a wealth of features. This article introduces some common features and provides examples. Due to space limitations, some advanced features are not covered, such as queues, storage, etc. For those interested in reptiles, go further.

If you find a fun and useful Go library, please Go to GitHub and submit issue😄

reference

  1. GitHub: github.com/darjun/go-d…
  2. Go daily goquery of library: darjun. Making. IO / 2020/10/11 /…
  3. A lot harder to implement in the Go Trending API: darjun. Making. IO / 2021/06/16 /…
  4. Colly GitHub:github.com/gocolly/col…

I

My blog is darjun.github. IO

Welcome to follow my wechat public account [GoUpUp], learn together, progress together ~