Golang implements simple crawler framework (2) - single-task version crawler

In the last blog “Golang simple crawler Framework (1) — Project Introduction and Environment Preparation”, we introduced the development environment construction of go language and crawler project introduction.

This crawler crawls the user information data of Zhenai, and the crawler steps are as follows:

1. Enter the city page of Zhenai and climb all the city information
2. Access the city details page and climb to the USER URL
3. Access the user details page to obtain the required user information

Note: in this crawler project, only a simple crawler architecture will be implemented, including stand-alone version implementation, simple concurrent version and concurrent version implementation of task scheduling using queues, as well as data storage and presentation functions. It does not involve simulated login, dynamic IP and other technologies. If you are a GO language novice looking for practice projects or readers interested in crawlers, please rest assured to eat them.

1. Single-task version crawler architecture

First of all, we realize a single task version of crawler, and do not consider the data storage and display module, first of all, the basic function is realized. Below is the overall framework of the single-mission crawler

The following is the specific process description:

1. First, the seed request needs to be configured, that is, the seed, which stores the initial entry of the project crawler
2. The initial entry information is sent to the crawler engine, which puts it into the task queue as task information and takes tasks from the task queue as long as the task queue is not empty
3. After fetching the task, the Engine hands it off to the Fetcher module, which fetches the web data from the URL and returns it to the Engine
The Parser module parses the required data and returns it to The Engine. The Engine receives the parsed information and prints it on the console

Project directory

2. Data structure definition

Take a look at the data structure in the project before we begin.

// /engine/types.go

package engine

// Request structure
type Request struct {
	Url       string // Request an address
	ParseFunc func([]byte) ParseResult// Parse function} // parse result structuretype ParseResult struct {
	Requests []Request     // The parsed request
	Items    []interface{} // Parse the content
}
Copy the code

Request represents a crawl Request, including the URL of the Request and the parse function used. The parse function returns a ParseResult type. The ParseResult type includes the parsed Request and the extracted content. Items is an interface{} type, meaning that the specific data structure is defined by the user.

Note: ForRequestWhether to use a city list parser or a user list parser for each URL depends on our specific businessEngineThe module doesn’t have to know exactly what the parsing function is, just whatRequestTo parse the web page data corresponding to the incoming URL

The definition of the data to be crawled

// /model/profile.go
package model

// User's personal information
type Profile struct {
	Name     string
	Gender   string
	Age      int
	Height   int
	Weight   int
	Income   string
	Marriage string
	Address  string
}
Copy the code

3, Fetcher implementation

The task of the Fetcher module is to get the web page data of the target URL, and put the code first.

// /fetcher/fetcher.go
package fetcher

import (
	"bufio"
	"fmt"
	"io/ioutil"
	"log"
	"net/http"

	"golang.org/x/net/html/charset"
	"golang.org/x/text/encoding"
	"golang.org/x/text/encoding/unicode"
	"golang.org/x/text/transform"
)

// Web content fetching function
func Fetch(url string) ([]byte, error) {

	client := &http.Client{}
	req, err := http.NewRequest("GET", url, nil)
	iferr ! =nil {
		log.Fatalln(err)
	}
	req.Header.Set("User-Agent"."Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36")

	resp, err := client.Do(req)
	iferr ! =nil {
		return nil, err
	}

	defer resp.Body.Close()

	// Error handling
	ifresp.StatusCode ! = http.StatusOK {return nil, fmt.Errorf("wrong state code: %d", resp.StatusCode)
	}

	// Convert the page to UTF-8 encoding
	bodyReader := bufio.NewReader(resp.Body)
	e := determineEncoding(bodyReader)
	utf8Reader := transform.NewReader(bodyReader, e.NewDecoder())
	return ioutil.ReadAll(utf8Reader)
}

func determineEncoding(r *bufio.Reader) encoding.Encoding {
	bytes, err := r.Peek(1024)
	iferr ! =nil {
		log.Printf("Fetcher error %v\n", err)
		return unicode.UTF8
	}
	e, _, _ := charset.DetermineEncoding(bytes, "")
	return e
}
Copy the code

Since the code of many web pages is GBK, we need to convert the data into UTF-8 code. Here, we need to download a package to complete the conversion. Open the terminal and enter gopm get-g-v golang.org/x/text to convert GBK code into UTF-8 code. Code on top

bodyReader := bufio.NewReader(resp.Body)
	e := determineEncoding(bodyReader)
	utf8Reader := transform.NewReader(bodyReader, e.NewDecoder())
Copy the code

Can be written as utf8Reader: = transform. NewReader (resp. Body, simplifiedchinese. GBK. NewDecoder ()) may be used. But the problem is poor universality, how do we know if the web page is GBK code? There’s another library we can introduce at this point that will help us figure out how to code a web page. Open the terminal and enter gopm get -g -v golang.org/x/net/html. Then extract the page coding module into a function, as shown in the code above.

4. Parser module implementation

(1) Parse city list and URL:

// /zhenai/parser/citylist.go
package parser

import (
	"crawler/engine"
	"regexp"
)

const cityListRe = `<a href="(http://www.zhenai.com/zhenghun/[0-9a-z]+)"[^>]*>([^<]+)</a>`

// Parse the list of cities
func ParseCityList(bytes []byte) engine.ParseResult {
	re := regexp.MustCompile(cityListRe)
    // Submatch is [][][] Byte data
    // The first [] represents the number of matched data, and the second [] represents the number of matched data to be extracted
	submatch := re.FindAllSubmatch(bytes, - 1)
	result := engine.ParseResult{}
	//limit := 10
	for _, item := range submatch {
		result.Items = append(result.Items, "City:"+string(item[2]))
		result.Requests = append(result.Requests, engine.Request{
			Url:       string(item[1]),	// The corresponding URL for each city
			ParseFunc: ParseCity,		// Use the city parser
		})
		//limit--
		//if limit == 0 {
		//	break
		/ /}
	}
	return result
}

Copy the code

In the code above, get all the cities and urls on the page, and then use the URL for each city as the URL for the next Request, using the ParseCity City parser.

When testing ParseCityList, if ParseFunc: We only want to test the city list parsing function. We don’t want to call ParseCity. We can define a function NilParseFun that returns an empty ParseResult, written as ParseFunc: NilParseFun, can.

func NilParseFun([]byte) ParseResult {
	return ParseResult{}
}
Copy the code

Because http://www.zhenai.com/zhenghun page city is more, in order to facilitate test can do a limit to the number of cities of parsing, part is the comments in the code.

Note: In the parsing module, exactly what information is parsed and how regular expressions are written is not the focus of this article. The emphasis is on understanding the connections between the various parsing modules and function calls, ditto

(2) Parse the user list and URL

// /zhenai/parse/city.go
package parser

import (
	"crawler/engine"
	"regexp"
)

var cityRe = regexp.MustCompile(`<a href="(http://album.zhenai.com/u/[0-9]+)"[^>]*>([^<]+)</a>`)

// The user gender is regular, because there is no gender information in the user details page, so the user gender is obtained in the user list page
var sexRe = regexp.MustCompile(` < td width = "180" > < span class = "grayL" > gender: < / span > ([^ <] +) < / td > `)

// City page user parser
func ParseCity(bytes []byte) engine.ParseResult {
	submatch := cityRe.FindAllSubmatch(bytes, - 1)
	gendermatch := sexRe.FindAllSubmatch(bytes, - 1)
	
	result := engine.ParseResult{}

	for k, item := range submatch {
		name := string(item[2])
		gender := string(gendermatch[k][1])

		result.Items = append(result.Items, "User:"+name)
		result.Requests = append(result.Requests, engine.Request{
			Url: string(item[1]),
			ParseFunc: func(bytes []byte) engine.ParseResult {
				return ParseProfile(bytes, name, gender)
			},
		})
	}
	return result
}

Copy the code

(3) Parsing user data

package parser

import (
	"crawler/engine"
	"crawler/model"
	"regexp"
	"strconv"
)

var ageRe = regexp.MustCompile('
      
       ]*>([\d]+) age 
      
')
var heightRe = regexp.MustCompile(`<div class="m-btn purple" [^>]*>([\d]+)cm</div>`)
var weightRe = regexp.MustCompile(`<div class="m-btn purple" [^>]*>([\d]+)kg</div>`)

var incomeRe = regexp.MustCompile(` < div class = "m - BTN purple" [^ >] * > monthly income: [[^ <] +) < / div > `)
var marriageRe = regexp.MustCompile(`<div class="m-btn purple" [^>]*>([^<]+)</div>`)
var addressRe = regexp.MustCompile(` < div class = "m - BTN purple" [^ >] * > work: [[^ <] +) < / div > `)

func ParseProfile(bytes []byte, name string, gender string) engine.ParseResult {
	profile := model.Profile{}
	profile.Name = name
	profile.Gender = gender
	if age, err := strconv.Atoi(extractString(bytes, ageRe)); err == nil {
		profile.Age = age
	}
	if height, err := strconv.Atoi(extractString(bytes, heightRe)); err == nil {
		profile.Height = height
	}
	if weight, err := strconv.Atoi(extractString(bytes, weightRe)); err == nil {
		profile.Weight = weight
	}

	profile.Income = extractString(bytes, incomeRe)
	profile.Marriage = extractString(bytes, marriageRe)
	profile.Address = extractString(bytes, addressRe)
	// No task is requested after user information is parsed
	result := engine.ParseResult{
		Items: []interface{}{profile},
	}
	return result
}

func extractString(contents []byte, re *regexp.Regexp) string {
	submatch := re.FindSubmatch(contents)
	if len(submatch) >= 2 {
		return string(submatch[1])}else {
		return ""}}Copy the code

5, Engine implementation

Engine module is the core of the whole system, obtaining web data, parsing data and maintaining task queues.

// /engine/engine.go
package engine

import (
	"crawler/fetcher"
	"log"
)

// Task execution function
func Run(seeds ... Request) {
	// Create a task queue
	var requests []Request
	// Add the incoming task to the task queue
	for _, r := range seeds {
		requests = append(requests, r)
	}
	// Keep crawling as long as the queue is not empty
	for len(requests) > 0 {

		request := requests[0]
		requests = requests[1:]
		// Grab web content
		log.Printf("Fetching %s\n", request.Url)
		content, err := fetcher.Fetch(request.Url)
		iferr ! =nil {
			log.Printf("Fetch error, Url: %s %v\n", request.Url, err)
			continue
		}
        // Parse web page data according to the parse function in the task request
		parseResult := request.ParseFunc(content)
		// Add the parsed request to the request queue
		requests = append(requests, parseResult.Requests...)
		// Prints parsed data
		for _, item := range parseResult.Items {
			log.Printf("Got item %v\n", item)
		}
	}
}
Copy the code

Engine module is mainly a Run function, receives the request one or more tasks, first add task request to task queue, and then judge the task queue if you do not have been taken from the queue is empty, pass Fetcher module to get the task requested URL web data, then according to the analytic function of the task request parsing data page. The parsed request is then queued and the parsed data is printed out.

6. Main function

package main

import (
	"crawler/engine"
	"crawler/zhenai/parser"
)

func main(a) {
	engine.Run(engine.Request{	// Configure the request information
		Url:       "http://www.zhenai.com/zhenghun",
		ParseFunc: parser.ParseCityList,
	})
}
Copy the code

Call the Run method directly from the main function, passing in the initial request.

7,

In this blog we use Go language to achieve a simple standalone version of the crawler project. Only focus on and crawler core architecture, without too much complex knowledge, the key is to understand the Engine module and the call relationship between each parsing module.

The disadvantages are that the stand-alone version crawls too slowly and does not use the powerful concurrency features of GO language. Therefore, in the next chapter, we will reconstruct the project into the crawler of the concurrent version based on this project.

If you want to get an in-depth go video from Google engineers, you can leave a comment in the comments section.

The source code of the project has been hosted to Github, for each version of the record, welcome to check, remember to give a star, in this first thank you.

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Golang implements simple crawler framework (2) – single-task version crawler

1. Single-task version crawler architecture

2. Data structure definition

3, Fetcher implementation

4. Parser module implementation

(1) Parse city list and URL:

(2) Parse the user list and URL

(3) Parsing user data

5, Engine implementation

6. Main function

7,

Golang implements simple crawler framework (2) – single-task version crawler

1. Single-task version crawler architecture

2. Data structure definition

3, Fetcher implementation

4. Parser module implementation

(1) Parse city list and URL:

(2) Parse the user list and URL

(3) Parsing user data

5, Engine implementation

6. Main function

7,

Related Posts

ElasticSearch basis

Understand the Nginx compilation and installation process

This section describes how to troubleshoot the problem that the FixedChannelPool connection is not released