In the last article, I used Go to write a crawler, but its execution is serial and inefficient. This article changes it to concurrent. Since this program only crawls 10 pages, it takes about 1s to complete. For comparison, we can add a bit of Sleep code to the previous doubancrawler1. go to make it run “slower” :
func parseUrls(url string) {
...
time.Sleep(2 * time.Second)
}Copy the code
In this way, it can be roughly calculated that the program needs about 21s+ to run. Let’s run it and try:
❯ go run doubancrawler2. go... Took 21.315744555 sCopy the code
It’s already slow. Then we started making it faster
Incorrect use of goroutine
Change to goroutine, the concurrency scheme supported by Go. Using Goroutine in Golang is very convenient, just use the Go keyword. Let’s look at a version:
func main() {
start := time.Now()
for i := 0; i < 10; i++ {
go parseUrls("https://movie.douban.com/top250?start=" + strconv.Itoa(25*i))
}
elapsed := time.Since(start)
fmt.Printf("Took %s", elapsed)
}Copy the code
Just add the go keyword to the function parseUrls. But that’s not true, because it doesn’t fetch anything. Because as soon as the coroutine was generated, the whole program was over, and Goroutine wasn’t done yet. What to do? It is possible to Sleep for a time longer than the slowest execution of any goroutine, thus ensuring that all coroutines run properly (doubancrawler3.go) :
func main() {
start := time.Now()
for i := 0; i < 10; i++ {
go parseUrls("https://movie.douban.com/top250?start=" + strconv.Itoa(25*i))
}
time.Sleep(4 * time.Second)
elapsed := time.Since(start)
fmt.Printf("Took %s", elapsed)
}Copy the code
Add Sleep 4 seconds after the for loop. If all the goroutines end in 3 seconds (2 seconds fixed Sleep+1 second program run), then one second of Sleep will be wasted! Run it:
❯ go run doubancrawler3. go... Took 4.000849896s # This time is approximately 4sCopy the code
The correct use of goroutine
So how do you use goroutine? Is there a join method like Python’s multi-process/thread method that waits for the child/thread to finish executing? Of course there is, you can have Go coroutines communicate with each other through channels that send data from one end and receive data from the other. The channels need to be paired to send and receive data, otherwise they will block:
func parseUrls(url string, ch chan bool) {
...
ch <- true
}
func main() {
start := time.Now()
ch := make(chan bool)
for i := 0; i < 10; i++ {
go parseUrls("https://movie.douban.com/top250?start="+strconv.Itoa(25*i), ch)
}
for i := 0; i < 10; i++ {
<-ch
}
elapsed := time.Since(start)
fmt.Printf("Took %s", elapsed)
}Copy the code
In the above change, parseUrls are executed in Goroutine, but notice that the function signature has been changed to accept the channel parameter CH. When the function logic ends, a Boolean value is sent to channel CH.
In main, with a for loop, < -ch waits for data to be received (in this case, just to confirm that the task is complete). This process enables a better concurrency scenario:
❯ go run doubancrawler4. go... Took 2.450826901s # this time than the previous write dead 4s optimization too much!Copy the code
sync.WaitGroup
Another good solution is sync.waitGroup. Our program simply prints the corresponding contents caught, so we use WaitGroup: wait for a set of concurrent operations to complete:
import ( ... "sync" ) ... func main() { start := time.Now() var wg sync.WaitGroup wg.Add(10) for i := 0; i < 10; i++ { go func() { defer wg.Done() parseUrls("https://movie.douban.com/top250?start="+strconv.Itoa(25*i)) }() } wg.Wait() elapsed := time.Since(start) fmt.Printf("Took %s", elapsed) }Copy the code
At first we added the number of goroutines to wait for in the call to WG.add, and our total number of pages was 10, so we can write it here.
Also, the defer keyword is used here to call WG.done to ensure that we indicate to WaitGroup that we have exited before exiting the Goroutine closure. Since wG.done and parseUrls2 are to be performed, the go keyword cannot be used directly, and the statement needs to be wrapped.
In this usage, WaitGroup is equivalent to a coroutine safe concurrency counter: call Add to increase the count, call Done to decrease the count. Calling Wait blocks and waits until the counter returns to zero. This also enables concurrency and waiting for all goroutine executions to complete:
❯ go run doubancrawler5. go... Took 2.382876529s #Copy the code
Afterword.
Well, this article will end here ~
The code address
The full code can be found at this address.