| Golang crawler tutorial to solve the problem | do a civilization of the crawler

In this paper, starting from imagician.net/archives/93… . Welcome to my blog imagician.net/ for more information.

** This article is an introductory tutorial on basic crawler and server relationships. Technologies such as headless browsers and JS mining are not discussed.

In the face of large and small crawler applications, crawling is a persistent problem. Restrictions are put in place to prevent a simple program from mindlessly fetching a large number of pages, which can cause a huge amount of request pressure on the site.

** Note that ** this article is talking about crawling public information. For example, the title of the article, the author, when it was published. Neither privacy nor paid digital products. Websites sometimes protect valuable digital products and use more sophisticated methods to prevent crawlers from stealing them. Not only is this kind of information hard to crawl, it shouldn’t be.

The reason for the site to reverse crawl public content is that it treats visitors like “human beings” who are friendly enough to go from page to page, jump from page to page, and log in, type in, refresh, etc. The machine is like a “ghost” of “Duang Duang Duang Duang” constantly requesting an Ajax interface, with no login and no context, increasing server stress and various traffic, bandwidth, and storage overhead.

Like the reverse crawl at station B

package main

import (
	"github.com/zhshch2002/goribot"
	"os"
	"strings"
)

func main(a) {
	s := goribot.NewSpider(goribot.SpiderLogError(os.Stdout))
	var h goribot.CtxHandlerFun
	h= func(ctx *goribot.Context) {
		if! strings.Contains(ctx.Resp.Text,"Chronological order"){
			ctx.AddItem(goribot.ErrorItem{
				Ctx: ctx,
				Msg: "",
			})
			ctx.AddTask(goribot.GetReq("https://www.bilibili.com/video/BV1tJ411V7eg"),h)
			ctx.AddTask(goribot.GetReq("https://www.bilibili.com/video/BV1tJ411V7eg"),h)
			ctx.AddTask(goribot.GetReq("https://www.bilibili.com/video/BV1tJ411V7eg"),h)
			ctx.AddTask(goribot.GetReq("https://www.bilibili.com/video/BV1tJ411V7eg"),h)
		}
	}
	s.AddTask(goribot.GetReq("https://www.bilibili.com/video/BV1tJ411V7eg"),h)
	s.Run()
}
Copy the code

Run the above code will keep visit www.bilibili.com/video/BV1tJ… This address. Using Goribot’s own error logging tool, I was quickly banned from Station B… HTTP 403 Access Forbidden

Sorry, persecuting the small broken station again, I will go back to the big member. Don’t call me; – D.

Invasive anti-crawl means

Many sites display content that is their product and contains value. Such sites set parameters (such as tokens) to identify machines more accurately.

For example, an Ajax request for a site has a Token, Signature, and browser identity set in the Cookie.

Such techniques declare that the information is forbidden to crawl and are not covered in this article.

Observe “etiquette”

Net/HTTP and Goribot are the main examples in this post, because I wrote that library.

Goribot provides many tools and is a lightweight crawler framework. See the documentation for details.

go get -u github.com/zhshch2002/goribot
Copy the code

Abide by robots. TXT

Robots.txt is a text file stored in the root directory of the website (that is, /robots.txt), that is, TXT. This file describes which pages spiders can and cannot climb. Note that this is allowed, robots.txt is only a convention, there is no other use.

However, it is obviously not normal for a crawler that does not obey robots.txt to visit pages that are not allowed (provided those pages are not the target of crawling, but are accessed unintentionally). These pages restricted by robots.txt are usually more sensitive, as those may be important pages of the site.

We restrict our crawlers from accessing those pages to avoid triggering certain rules.

Support for robots.txt in Goribot uses github.com/slyrz/robot… .

s := goribot.NewSpider(
    goribot.RobotsTxt("https://github.com"."Goribot"),Copy the code

Here we create a crawler and load a robots.txt plug-in. “Goribot” is the name of the crawler. In the robots.txt file, different rules can be set for crawlers with different names, and this parameter is opposite to it. https://github.com” is the address to obtain robots.txt, because it is said above that robots.txt can only be set in the root directory of the website, and the scope is only the page under the same host. Here, only the URL of the root directory can be set.

Control concurrency and rate

Imagine that you write a crawler that visits only one page and parses the HTML. The program is placed in an infinite loop where new threads are created. Well, that sounds good.

For the website server, there is an IP, began very high frequency requests, and traffic bandwidth is getting bigger and bigger, a god 3Gbps!! ? Are you visiting for a DDos? Ban IP decisively.

After that, you have a bunch of HTTP 403 Access forbiddens that the crawler has collected.

Of course, the above is just an exaggerated example, no one else has that much bandwidth… Ah, it looks like they’ve got it in the house of a Canadian hooker king. And no one programmed it that way.

Controlling the concurrency of requests and adding latency can greatly reduce the strain on the server, even if the request speed is slower. But we’re here to collect data, not bring the site down.

In Goribot you can set it like this:

s := goribot.NewSpider(
    goribot.Limiter(false, &goribot.LimitRule{
        Glob: "httpbin.org",
        Rate:        2.// Request rate limit (same as 2 requests per second under host, too many requests will block waiting)}),)Copy the code

Limiter is a sophisticated extension in Goribot that controls speed, concurrency, whitelist, and random latency. Please refer to the usage documentation for more information.

Technical means

The site treats all requesters as if they were human and uses features of non-human behavior as a means of detection. We can then make our programs mimic human (and browser) behavior to avoid anti-crawling.

UA

As a crawler developer, UA is no stranger, or user-agent. If you visit a GitHub site using Chrome, for example, the UA in an HTTP request is filled in by Chrome and sent to the site’s server. UA literally means user agent, which means what tools users use to access a website. (After all, users can’t write HTTP messages themselves, except developers; – D)

Websites can simply eliminate requests from machines by identifying uAs. Golang’s native NET/HTTP package, for example, automatically sets up a UA to indicate that the request was made by the Golang application, and many sites filter such requests.

In Golang’s native net/ HTTP package, you can set UA like this :(where “user-agent” is case-insensitive)

r, _ := http.NewRequest("GET"."https://github.com".nil)
r.Header.Set("User-Agent"."Goribot")
Copy the code

In Goribot you can set the UA of the request by chain operation:

goribot.GetReq("https://github.com").SetHeader("User-Agent"."Goribot")
Copy the code

It’s annoying to set up UA manually all the time, and you have to make up a UA every time to pretend you’re a browser. So we have the automatic random UA setup plugin:

s := goribot.NewSpider(
    goribot.RandomUserAgent(),
)
Copy the code

Referer

Referer is included in the request header and means “From which URL did I jump to this request?” Short for “Where do I come from?” . If your application keeps sending requests that don’t contain the Referer or that are empty, the server will say, “Hey, where are you from? The secret garden? Gnu!” Then you have HTTP 403 Access Forbidden.

In Golang’s native NET/HTTP package, you can set the Referer like this:

r, _ := http.NewRequest("GET"."https://github.com".nil)
r.Header.Set("Referer"."https://www.google.com")
Copy the code

The Referer auto-fill plugin can be installed in Goribot to fill the address of the previous request for a new request:

s := goribot.NewSpider(
    goribot.RefererFiller(),
)
Copy the code

Cookie

Cookies should be very common. Various websites use cookies to store login information such as account numbers. Cookies are essentially key-value data stored on the client browser by the website server. Specific knowledge about cookies can be found on Baidu or Google.

The Goribot crawler is created with a Cookie Jar, which automatically manages the Cookie information when the crawler runs. We can set cookies for requests to mimic the effect of a person logging in to a browser.

Use Golang native NET/HTTP and enable Cookie Jar to set login with Cookie:

package main

/ / code from https://studygolang.com/articles/10842, thank you very much

import (
    "fmt"
    "io/ioutil"
    "net/http"
    "net/http/cookiejar"
    // "os"
    "net/url"
    "time"
)

func main(a) {
    //Init jar
    j, _ := cookiejar.New(nil)
    // Create client
    client := &http.Client{Jar: j}

    // Start modifying the values in the cache jar
    var clist []*http.Cookie
    clist = append(clist, &http.Cookie{
        Name:    "BDUSS",
        Domain:  ".baidu.com",
        Path:    "/",
        Value:   "Cookie value XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
        Expires: time.Now().AddDate(1.0.0),
    })
    urlX, _ := url.Parse("http://zhanzhang.baidu.com")
    j.SetCookies(urlX, clist)

    fmt.Printf("Jar cookie : %v", j.Cookies(urlX))
    
    // Fetch Request
    resp, err = client.Do(req)
    iferr ! =nil {
        fmt.Println("Failure : ", err)
    }

    respBody, _ := ioutil.ReadAll(resp.Body)

    // Display Results
    fmt.Println("response Status : ", resp.Status)
    fmt.Println("response Body : ".string(respBody))
    fmt.Printf("response Cookies :%v", resp.Cookies())
}
Copy the code

In Goribot you can do this:

s.AddTask(goribot.GetReq("https://www.bilibili.com/video/BV1tJ411V7eg").AddCookie(&http.Cookie{
        Name:    "BDUSS",
        Value:   "Cookie value XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
        Expires: time.Now().AddDate(1.0.0),
    }),handlerFunc)
Copy the code

So later in s.run (), the request will be Cookie set and subsequent cookies maintained by the Cookie Jar.

| Golang crawler tutorial to solve the problem | do a civilization of the crawler

Like the reverse crawl at station B

Invasive anti-crawl means

Observe “etiquette”

Abide by robots. TXT

Control concurrency and rate

Technical means

UA

Referer

Cookie

Related Posts

Java Development interviewer Terminator! HashMap interview questions summary, will win!

High concurrency elegant do current limiting

DDD domain concepts