Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”.

This article has participated in the “Digitalstar Project” and won a creative gift package to challenge the creative incentive money

Writing in the front

Python crawlers are probably boring, so try Golang crawlers! This article will continue to be updated!

Mind mapping

@TOC

Golang provides the NET/HTTP package with native support for request and Response.

1. Send the request

  • Constructing the client
	var client http.Client
Copy the code
  • Construct a GET request:
	reqList, err := http.NewRequest("GET", URL, nil)
Copy the code
  • Constructing a POST request

Go provides a cookiejar.New function method, which is used to retain the generated Cookie information. This is for the case that some websites can only be accessed after logging in, so after logging in, there will be a Cookie, which stores user information. This message lets the server know who is making the call! For example, we need to log in the teaching affairs office of the school to crawl the class schedule. Because the class schedule may be different for everyone, we need to log in and let the server know whose class schedule information it is. Therefore, we need to add cookies on the request head for camouflage crawling.

	jar, err := cookiejar.New(nil)
	iferr ! =nil {
		panic(err)
	}
Copy the code

When constructing a POST request, you can encapsulate the data to be transferred and construct it with the URL

	var client http.Client
	Info :="muser="+muserid+"&"+"passwd="+password
	var data = strings.NewReader(Info)
	req, err := http.NewRequest("POST", URL, data)
Copy the code
  • Add headers
	req.Header.Set("Connection"."keep-alive")
	req.Header.Set("Pragma"."no-cache")
	req.Header.Set("Cache-Control"."no-cache")
	req.Header.Set("Upgrade-Insecure-Requests"."1")
	req.Header.Set("Content-Type"."application/x-www-form-urlencoded")
	req.Header.Set("User-Agent"."Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36")
	req.Header.Set("Accept"."text/html,application/xhtml+xml,application/xml; Q = 0.9, image/avif, image/webp image/apng, * / *; Q = 0.8, application/signed - exchange; v=b3; Q = 0.9")
	req.Header.Set("Accept-Language"."zh-CN,zh; Q = 0.9")
Copy the code
  • Send the request
	resp, _:= client.Do(req)  // Send the request
	bodyText, _ := ioutil.ReadAll(resp.Body)  // Use buffer to read web page content
Copy the code
  • About the cookie

As mentioned above, cookies are stored in the client.jar package when the request is sent

	myStr:=fmt.Sprintf("%s",client.Jar)   // force the cast pointer to string
Copy the code

After printing the client.Jar package, select the response cookie and place it on the request header. I can handle cookies in the case of login.

	req.Header.Set("Cookie"."ASP.NET_SessionId="+cook)
Copy the code

At this point, the sending of the request part is complete!

2. Parse the web page

2.1 CSS selectors

Github.com/PuerkitoBio/goquery provides. The NewDocumentFromReader method parses a web page.

	doc, err := goquery.NewDocumentFromReader(resp.Body)
Copy the code

2.2 Xpath syntax

Github.com/antchfx/htmlquery provides. The Parse method parses web pages

	root, _ := htmlquery.Parse(resp.Body)
Copy the code

2.3 the Regex regular

	reId, _ := regexp.Compile(`id=(\d+)`)  // Regex matches
	allId := reId.FindAll(bodyText,1)
	for _,item := range allId {
		id=string(item)
	}
Copy the code

3. Obtain node information

3.1 CSS selectors

Through 2.1, after we get the doc parsed in the previous step, we can carry out CSS selector syntax and select nodes.

doc.Find("#main > div.right > div.detail_main_content").
			Each(func(i int, s *goquery.Selection) {
			Data.title = s.Find("p").Text()
			Data.time = s.Find("#fbsj").Text()
			Data.author = s.Find("#author").Text()
			Data.count = Read_Count(Read_Id)
			fmt.Println(Data.title, Data.time, Data.author,Data.count)
		})

doc.Find("#news_content_display").Each(func(i int, s *goquery.Selection) {
			Data.content = s.Find("p").Text()
			fmt.Println(Data.content)
		})
Copy the code

3.2 Xpath syntax

Through 3.2, after we get the root parsed in the previous step, we can write Xpath syntax and select nodes.

	tr := htmlquery.Find(root, "//*[@id='LB_kb']/table/tbody/tr/td")   // Use Xpath to get node information
	for _, row := range tr { //len(tr)=13
		classNames := htmlquery.Find(row, "./font")
		classPosistions := htmlquery.Find(row,"./text()[4]")
		classTeachers := htmlquery.Find(row,"./text()[5]")
		if len(classNames)! =0 {
			className = htmlquery.InnerText(classNames[0])
			classPosistion = htmlquery.InnerText(classPosistions[0])
			classTeacher = htmlquery.InnerText(classTeachers[0])
		  fmt.Println(className)
		  fmt.Println(classPosistion)
		  fmt.Println(classTeacher)
		}
	}
Copy the code

4. Save the information

4.1 Use native SQL statements to save data in Mysql

  • Define database link parameters
const (
	usernameClass = "root"
	passwordClass = "root"
	ipClass       = "127.0.0.1"
	portClass     = "3306"
	dbnameClass   = "class"
)
Copy the code
  • Connecting to a Database
var DB *sql.DB
func InitDB(a){
	path := strings.Join([]string{usernameClass, ":", passwordClass, "@tcp(", ipClass, ":", portClass, "/", dbnameClass, "? charset=utf8"}, "")
	DB, _ = sql.Open("mysql", path)
	DB.SetConnMaxLifetime(10)
	DB.SetMaxIdleConns(5)
	iferr := DB.Ping(); err ! =nil{
		fmt.Println("opon database fail")
		return
	}
	fmt.Println("connect success")}Copy the code
  • Defining data types
type Class struct {
	classData   string
	teacherName string
	position    string
}
Copy the code
  • Insert data
func InsertData(Data Class) bool {
	tx, err := DB.Begin()
	iferr ! =nil{
		fmt.Println("tx fail")
		return false
	}
	stmt, err := tx.Prepare("INSERT INTO class_data (`class`,`teacher`,`position`) VALUES (? ,? ,?) ")
	iferr ! =nil{  // Insert data
		fmt.Println("Prepare fail",err)
		return false
	}
	_, err = stmt.Exec(Data.classData,Data.teacherName,Data.position)  // Execute a transaction
	iferr ! =nil{
		fmt.Println("Exec fail",err)
		return false
	}
	_ = tx.Commit()  // Commit the transaction
	return true
}
Copy the code

4.2 Using GORM to save data to Mysql

  • Construct GORM model model
type NewD struct {
	gorm.Model
	Title   string `gorm:"type:varchar(255); not null;" `
	Time    string `gorm:"type:varchar(256); not null;" `
	Author  string `gorm:"type:varchar(256); not null;" `
	Count   string `gorm:"type:varchar(256); not null;" `
	Content string `gorm:"type:longtext; not null;" `
}
Copy the code
  • Connecting to a Database
var db *gorm.DB

func Init(a) {
	var err error
	path := strings.Join([]string{userName_New, ":", password_New, "@tcp(",ip_New, ":", port_New, "/", dbName_New, "? charset=utf8"}, "")
	db, err = gorm.Open("mysql", path)
	iferr ! =nil {
		panic(err)
	}
	fmt.Println("SUCCESS")
	_ = db.AutoMigrate(&NewD{})
	sqlDB := db.DB()
	sqlDB.SetMaxIdleConns(10)
	sqlDB.SetMaxOpenConns(100)}Copy the code
  • Write data
	NewA := NewD{
		Title:   Data.title,
		Time:    Data.time,
		Author:  Data.author,
		Count:   Data.count,
		Content: Data.content,
	}
	err = db.Create(&NewA).Error  // Create a data item in the database
Copy the code