In the last blog “Golang simple crawler Framework (1) — Project Introduction and Environment Preparation”, we introduced the development environment construction of go language and crawler project introduction.
This crawler crawls the user information data of Zhenai, and the crawler steps are as follows:
- 1. Enter the city page of Zhenai and climb all the city information
- 2. Access the city details page and climb to the USER URL
- 3. Access the user details page to obtain the required user information
Note: in this crawler project, only a simple crawler architecture will be implemented, including stand-alone version implementation, simple concurrent version and concurrent version implementation of task scheduling using queues, as well as data storage and presentation functions. It does not involve simulated login, dynamic IP and other technologies. If you are a GO language novice looking for practice projects or readers interested in crawlers, please rest assured to eat them.
1. Single-task version crawler architecture
First of all, we realize a single task version of crawler, and do not consider the data storage and display module, first of all, the basic function is realized. Below is the overall framework of the single-mission crawler
The following is the specific process description:
- 1. First, the seed request needs to be configured, that is, the seed, which stores the initial entry of the project crawler
- 2. The initial entry information is sent to the crawler engine, which puts it into the task queue as task information and takes tasks from the task queue as long as the task queue is not empty
- 3. After fetching the task, the Engine hands it off to the Fetcher module, which fetches the web data from the URL and returns it to the Engine
- The Parser module parses the required data and returns it to The Engine. The Engine receives the parsed information and prints it on the console
Project directory
2. Data structure definition
Take a look at the data structure in the project before we begin.
// /engine/types.go
package engine
// Request structure
type Request struct {
Url string // Request an address
ParseFunc func([]byte) ParseResult// Parse function} // parse result structuretype ParseResult struct {
Requests []Request // The parsed request
Items []interface{} // Parse the content
}
Copy the code
Request represents a crawl Request, including the URL of the Request and the parse function used. The parse function returns a ParseResult type. The ParseResult type includes the parsed Request and the extracted content. Items is an interface{} type, meaning that the specific data structure is defined by the user.
Note: ForRequest
Whether to use a city list parser or a user list parser for each URL depends on our specific businessEngine
The module doesn’t have to know exactly what the parsing function is, just whatRequest
To parse the web page data corresponding to the incoming URL
The definition of the data to be crawled
// /model/profile.go
package model
// User's personal information
type Profile struct {
Name string
Gender string
Age int
Height int
Weight int
Income string
Marriage string
Address string
}
Copy the code
3, Fetcher implementation
The task of the Fetcher module is to get the web page data of the target URL, and put the code first.
// /fetcher/fetcher.go
package fetcher
import (
"bufio"
"fmt"
"io/ioutil"
"log"
"net/http"
"golang.org/x/net/html/charset"
"golang.org/x/text/encoding"
"golang.org/x/text/encoding/unicode"
"golang.org/x/text/transform"
)
// Web content fetching function
func Fetch(url string) ([]byte, error) {
client := &http.Client{}
req, err := http.NewRequest("GET", url, nil)
iferr ! =nil {
log.Fatalln(err)
}
req.Header.Set("User-Agent"."Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36")
resp, err := client.Do(req)
iferr ! =nil {
return nil, err
}
defer resp.Body.Close()
// Error handling
ifresp.StatusCode ! = http.StatusOK {return nil, fmt.Errorf("wrong state code: %d", resp.StatusCode)
}
// Convert the page to UTF-8 encoding
bodyReader := bufio.NewReader(resp.Body)
e := determineEncoding(bodyReader)
utf8Reader := transform.NewReader(bodyReader, e.NewDecoder())
return ioutil.ReadAll(utf8Reader)
}
func determineEncoding(r *bufio.Reader) encoding.Encoding {
bytes, err := r.Peek(1024)
iferr ! =nil {
log.Printf("Fetcher error %v\n", err)
return unicode.UTF8
}
e, _, _ := charset.DetermineEncoding(bytes, "")
return e
}
Copy the code
Since the code of many web pages is GBK, we need to convert the data into UTF-8 code. Here, we need to download a package to complete the conversion. Open the terminal and enter gopm get-g-v golang.org/x/text to convert GBK code into UTF-8 code. Code on top
bodyReader := bufio.NewReader(resp.Body)
e := determineEncoding(bodyReader)
utf8Reader := transform.NewReader(bodyReader, e.NewDecoder())
Copy the code
Can be written as utf8Reader: = transform. NewReader (resp. Body, simplifiedchinese. GBK. NewDecoder ()) may be used. But the problem is poor universality, how do we know if the web page is GBK code? There’s another library we can introduce at this point that will help us figure out how to code a web page. Open the terminal and enter gopm get -g -v golang.org/x/net/html. Then extract the page coding module into a function, as shown in the code above.
4. Parser module implementation
(1) Parse city list and URL:
// /zhenai/parser/citylist.go
package parser
import (
"crawler/engine"
"regexp"
)
const cityListRe = `<a href="(http://www.zhenai.com/zhenghun/[0-9a-z]+)"[^>]*>([^<]+)</a>`
// Parse the list of cities
func ParseCityList(bytes []byte) engine.ParseResult {
re := regexp.MustCompile(cityListRe)
// Submatch is [][][] Byte data
// The first [] represents the number of matched data, and the second [] represents the number of matched data to be extracted
submatch := re.FindAllSubmatch(bytes, - 1)
result := engine.ParseResult{}
//limit := 10
for _, item := range submatch {
result.Items = append(result.Items, "City:"+string(item[2]))
result.Requests = append(result.Requests, engine.Request{
Url: string(item[1]), // The corresponding URL for each city
ParseFunc: ParseCity, // Use the city parser
})
//limit--
//if limit == 0 {
// break
/ /}
}
return result
}
Copy the code
In the code above, get all the cities and urls on the page, and then use the URL for each city as the URL for the next Request, using the ParseCity City parser.
When testing ParseCityList, if ParseFunc: We only want to test the city list parsing function. We don’t want to call ParseCity. We can define a function NilParseFun that returns an empty ParseResult, written as ParseFunc: NilParseFun, can.
func NilParseFun([]byte) ParseResult {
return ParseResult{}
}
Copy the code
Because http://www.zhenai.com/zhenghun page city is more, in order to facilitate test can do a limit to the number of cities of parsing, part is the comments in the code.
Note: In the parsing module, exactly what information is parsed and how regular expressions are written is not the focus of this article. The emphasis is on understanding the connections between the various parsing modules and function calls, ditto
(2) Parse the user list and URL
// /zhenai/parse/city.go
package parser
import (
"crawler/engine"
"regexp"
)
var cityRe = regexp.MustCompile(`<a href="(http://album.zhenai.com/u/[0-9]+)"[^>]*>([^<]+)</a>`)
// The user gender is regular, because there is no gender information in the user details page, so the user gender is obtained in the user list page
var sexRe = regexp.MustCompile(` < td width = "180" > < span class = "grayL" > gender: < / span > ([^ <] +) < / td > `)
// City page user parser
func ParseCity(bytes []byte) engine.ParseResult {
submatch := cityRe.FindAllSubmatch(bytes, - 1)
gendermatch := sexRe.FindAllSubmatch(bytes, - 1)
result := engine.ParseResult{}
for k, item := range submatch {
name := string(item[2])
gender := string(gendermatch[k][1])
result.Items = append(result.Items, "User:"+name)
result.Requests = append(result.Requests, engine.Request{
Url: string(item[1]),
ParseFunc: func(bytes []byte) engine.ParseResult {
return ParseProfile(bytes, name, gender)
},
})
}
return result
}
Copy the code
(3) Parsing user data
package parser
import (
"crawler/engine"
"crawler/model"
"regexp"
"strconv"
)
var ageRe = regexp.MustCompile('
]*>([\d]+) age
')
var heightRe = regexp.MustCompile(`<div class="m-btn purple" [^>]*>([\d]+)cm</div>`)
var weightRe = regexp.MustCompile(`<div class="m-btn purple" [^>]*>([\d]+)kg</div>`)
var incomeRe = regexp.MustCompile(` < div class = "m - BTN purple" [^ >] * > monthly income: [[^ <] +) < / div > `)
var marriageRe = regexp.MustCompile(`<div class="m-btn purple" [^>]*>([^<]+)</div>`)
var addressRe = regexp.MustCompile(` < div class = "m - BTN purple" [^ >] * > work: [[^ <] +) < / div > `)
func ParseProfile(bytes []byte, name string, gender string) engine.ParseResult {
profile := model.Profile{}
profile.Name = name
profile.Gender = gender
if age, err := strconv.Atoi(extractString(bytes, ageRe)); err == nil {
profile.Age = age
}
if height, err := strconv.Atoi(extractString(bytes, heightRe)); err == nil {
profile.Height = height
}
if weight, err := strconv.Atoi(extractString(bytes, weightRe)); err == nil {
profile.Weight = weight
}
profile.Income = extractString(bytes, incomeRe)
profile.Marriage = extractString(bytes, marriageRe)
profile.Address = extractString(bytes, addressRe)
// No task is requested after user information is parsed
result := engine.ParseResult{
Items: []interface{}{profile},
}
return result
}
func extractString(contents []byte, re *regexp.Regexp) string {
submatch := re.FindSubmatch(contents)
if len(submatch) >= 2 {
return string(submatch[1])}else {
return ""}}Copy the code
5, Engine implementation
Engine module is the core of the whole system, obtaining web data, parsing data and maintaining task queues.
// /engine/engine.go
package engine
import (
"crawler/fetcher"
"log"
)
// Task execution function
func Run(seeds ... Request) {
// Create a task queue
var requests []Request
// Add the incoming task to the task queue
for _, r := range seeds {
requests = append(requests, r)
}
// Keep crawling as long as the queue is not empty
for len(requests) > 0 {
request := requests[0]
requests = requests[1:]
// Grab web content
log.Printf("Fetching %s\n", request.Url)
content, err := fetcher.Fetch(request.Url)
iferr ! =nil {
log.Printf("Fetch error, Url: %s %v\n", request.Url, err)
continue
}
// Parse web page data according to the parse function in the task request
parseResult := request.ParseFunc(content)
// Add the parsed request to the request queue
requests = append(requests, parseResult.Requests...)
// Prints parsed data
for _, item := range parseResult.Items {
log.Printf("Got item %v\n", item)
}
}
}
Copy the code
Engine module is mainly a Run function, receives the request one or more tasks, first add task request to task queue, and then judge the task queue if you do not have been taken from the queue is empty, pass Fetcher module to get the task requested URL web data, then according to the analytic function of the task request parsing data page. The parsed request is then queued and the parsed data is printed out.
6. Main function
package main
import (
"crawler/engine"
"crawler/zhenai/parser"
)
func main(a) {
engine.Run(engine.Request{ // Configure the request information
Url: "http://www.zhenai.com/zhenghun",
ParseFunc: parser.ParseCityList,
})
}
Copy the code
Call the Run method directly from the main function, passing in the initial request.
7,
In this blog we use Go language to achieve a simple standalone version of the crawler project. Only focus on and crawler core architecture, without too much complex knowledge, the key is to understand the Engine module and the call relationship between each parsing module.
The disadvantages are that the stand-alone version crawls too slowly and does not use the powerful concurrency features of GO language. Therefore, in the next chapter, we will reconstruct the project into the crawler of the concurrent version based on this project.
If you want to get an in-depth go video from Google engineers, you can leave a comment in the comments section.
The source code of the project has been hosted to Github, for each version of the record, welcome to check, remember to give a star, in this first thank you.