Original link: strconv.com/posts/web-c…

In the last article, I used Go to write a crawler, but its execution is serial and inefficient. This article changes it to concurrent. Since this program only crawls 10 pages, it takes about 1s to complete. For comparison, we can add a bit of Sleep code to the previous doubancrawler1. go to make it run “slower” :

func parseUrls(url string){... time.Sleep(2 * time.Second)
}
` `'Go run like this can be roughly calculated to finish the program needs about 21s+, we run to try:'` `Bash ❯go run doubanCrawler2.go. Took21.315744555 s
Copy the code

It’s already slow. Then we started making it faster

Incorrect use of goroutine

Change to goroutine, the concurrency scheme supported by Go. Using Goroutine in Golang is very convenient, just use the Go keyword. Let’s look at a version:

func main(a) {
	start := time.Now()
	for i := 0; i < 10; i++ {
		go parseUrls("https://movie.douban.com/top250?start=" + strconv.Itoa(25*i))
	}
	elapsed := time.Since(start)
	fmt.Printf("Took %s", elapsed)
}
Copy the code

Just add the go keyword to the function parseUrls. But that’s not true, because it doesn’t fetch anything. Because as soon as the coroutine was generated, the whole program was over, and Goroutine wasn’t done yet. What to do? It is possible to Sleep for a time longer than the slowest execution of any goroutine, thus ensuring that all coroutines run properly (doubancrawler3.go) :

func main(a) {
    start := time.Now()
    for i := 0; i < 10; i++ {
        go parseUrls("https://movie.douban.com/top250?start=" + strconv.Itoa(25*i))
    }
    time.Sleep(4 * time.Second)
    elapsed := time.Since(start)
    fmt.Printf("Took %s", elapsed)
}
Copy the code

Add Sleep 4 seconds after the for loop. If all the goroutines end in 3 seconds (2 seconds fixed Sleep+1 second program run), then one second of Sleep will be wasted! Run it:

❯ go run doubancrawler3. go... Took 4.000849896 s# This time is approximately 4s
Copy the code

The correct use of goroutine

So how do you use goroutine? Is there a join method like Python’s multi-process/thread method that waits for the child/thread to finish executing? Of course there is, you can have Go coroutines communicate with each other through channels that send data from one end and receive data from the other. The channels need to be paired to send and receive data, otherwise they will block:

func parseUrls(url string, ch chan bool) {
    ...
    ch <- true
}

func main() {
    start := time.Now()
    ch := make(chan bool)
    for i := 0; i < 10; i++ {
        go parseUrls("https://movie.douban.com/top250?start="+strconv.Itoa(25*i), ch)
    }

    for i := 0; i < 10; i++ {
        <-ch
    }

    elapsed := time.Since(start)
    fmt.Printf("Took %s", elapsed)
}
Copy the code

In the above change, parseUrls are executed in Goroutine, but notice that the function signature has been changed to accept the channel parameter CH. When the function logic ends, a Boolean value is sent to channel CH.

In main, with a for loop, < -ch waits for data to be received (in this case, just to confirm that the task is complete). This process enables a better concurrency scenario:

❯ go run doubancrawler4. go... Took 2.450826901 s# This time is much more than the previous optimization that killed 4S!
Copy the code

sync.WaitGroup

Another good solution is sync.waitGroup. Our program simply prints the corresponding contents caught, so we use WaitGroup: wait for a set of concurrent operations to complete:

import(..."sync")...func main(a) {
	start := time.Now()
	var wg sync.WaitGroup
	wg.Add(10)

	for i := 0; i < 10; i++ {
		go func(a) {
			defer wg.Done()
			parseUrls("https://movie.douban.com/top250?start="+strconv.Itoa(25*i))
		}()
	}

	wg.Wait()

	elapsed := time.Since(start)
	fmt.Printf("Took %s", elapsed)
}
Copy the code

At first we added the number of goroutines to wait for in the call to WG.add, and our total number of pages was 10, so we can write it here.

Also, the defer keyword is used here to call WG.done to ensure that we indicate to WaitGroup that we have exited before exiting the Goroutine closure. Since wG.done and parseUrls2 are to be performed, the go keyword cannot be used directly, and the statement needs to be wrapped.

(Thanks @bhblinux for pointing that out.) Note, however, that you need to pass in the closure argument I as an argument to func, otherwise I will use the value of the last loop:

// Error code 👇
for i := 0; i < 10; i++ {
    go func(a) {
        defer wg.Done()
        parseUrls("https://movie.douban.com/top250?start="+strconv.Itoa(25* I))}} ❯ ()go run crawler/doubanCrawler5.go
Fetch Url https://movie.douban.com/top250?start=75
Fetch Url https://movie.douban.com/top250?start=250
Fetch Url https://movie.douban.com/top250?start=250
Fetch Url https://movie.douban.com/top250?start=250
Fetch Url https://movie.douban.com/top250?start=250
Fetch Url https://movie.douban.com/top250?start=250
Fetch Url https://movie.douban.com/top250?start=250
Fetch Url https://movie.douban.com/top250?start=250
Fetch Url https://movie.douban.com/top250?start=250
Fetch Url https://movie.douban.com/top250?start=200.Copy the code

Hey, look at the code, the loop ends when I equals 9, so start should be 225, but why 250? And that’s because I ++ at the end, it doesn’t work, but it does change the value of I!

In this usage, WaitGroup is equivalent to a coroutine safe concurrency counter: call Add to increase the count, call Done to decrease the count. Calling Wait blocks and waits until the counter returns to zero. This also enables concurrency and waiting for all goroutine executions to complete:

❯ go run doubancrawler5. go... Took 2.382876529 sThis time is the same as the previous channel usage effect!
Copy the code

Afterword.

Well, this article will end here ~

The code address

The full code can be found at this address.