Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”.
This article has participated in the “Digitalstar Project” and won a creative gift package to challenge the creative incentive money
Writing in the front
Python crawlers are probably boring, so try Golang crawlers! This article will continue to be updated!
Mind mapping
@TOC
Golang provides the NET/HTTP package with native support for request and Response.
1. Send the request
- Constructing the client
var client http.Client
Copy the code
- Construct a GET request:
reqList, err := http.NewRequest("GET", URL, nil)
Copy the code
- Constructing a POST request
Go provides a cookiejar.New function method, which is used to retain the generated Cookie information. This is for the case that some websites can only be accessed after logging in, so after logging in, there will be a Cookie, which stores user information. This message lets the server know who is making the call! For example, we need to log in the teaching affairs office of the school to crawl the class schedule. Because the class schedule may be different for everyone, we need to log in and let the server know whose class schedule information it is. Therefore, we need to add cookies on the request head for camouflage crawling.
jar, err := cookiejar.New(nil)
iferr ! =nil {
panic(err)
}
Copy the code
When constructing a POST request, you can encapsulate the data to be transferred and construct it with the URL
var client http.Client
Info :="muser="+muserid+"&"+"passwd="+password
var data = strings.NewReader(Info)
req, err := http.NewRequest("POST", URL, data)
Copy the code
- Add headers
req.Header.Set("Connection"."keep-alive")
req.Header.Set("Pragma"."no-cache")
req.Header.Set("Cache-Control"."no-cache")
req.Header.Set("Upgrade-Insecure-Requests"."1")
req.Header.Set("Content-Type"."application/x-www-form-urlencoded")
req.Header.Set("User-Agent"."Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36")
req.Header.Set("Accept"."text/html,application/xhtml+xml,application/xml; Q = 0.9, image/avif, image/webp image/apng, * / *; Q = 0.8, application/signed - exchange; v=b3; Q = 0.9")
req.Header.Set("Accept-Language"."zh-CN,zh; Q = 0.9")
Copy the code
- Send the request
resp, _:= client.Do(req) // Send the request
bodyText, _ := ioutil.ReadAll(resp.Body) // Use buffer to read web page content
Copy the code
- About the cookie
As mentioned above, cookies are stored in the client.jar package when the request is sent
myStr:=fmt.Sprintf("%s",client.Jar) // force the cast pointer to string
Copy the code
After printing the client.Jar package, select the response cookie and place it on the request header. I can handle cookies in the case of login.
req.Header.Set("Cookie"."ASP.NET_SessionId="+cook)
Copy the code
At this point, the sending of the request part is complete!
2. Parse the web page
2.1 CSS selectors
Github.com/PuerkitoBio/goquery provides. The NewDocumentFromReader method parses a web page.
doc, err := goquery.NewDocumentFromReader(resp.Body)
Copy the code
2.2 Xpath syntax
Github.com/antchfx/htmlquery provides. The Parse method parses web pages
root, _ := htmlquery.Parse(resp.Body)
Copy the code
2.3 the Regex regular
reId, _ := regexp.Compile(`id=(\d+)`) // Regex matches
allId := reId.FindAll(bodyText,1)
for _,item := range allId {
id=string(item)
}
Copy the code
3. Obtain node information
3.1 CSS selectors
Through 2.1, after we get the doc parsed in the previous step, we can carry out CSS selector syntax and select nodes.
doc.Find("#main > div.right > div.detail_main_content").
Each(func(i int, s *goquery.Selection) {
Data.title = s.Find("p").Text()
Data.time = s.Find("#fbsj").Text()
Data.author = s.Find("#author").Text()
Data.count = Read_Count(Read_Id)
fmt.Println(Data.title, Data.time, Data.author,Data.count)
})
doc.Find("#news_content_display").Each(func(i int, s *goquery.Selection) {
Data.content = s.Find("p").Text()
fmt.Println(Data.content)
})
Copy the code
3.2 Xpath syntax
Through 3.2, after we get the root parsed in the previous step, we can write Xpath syntax and select nodes.
tr := htmlquery.Find(root, "//*[@id='LB_kb']/table/tbody/tr/td") // Use Xpath to get node information
for _, row := range tr { //len(tr)=13
classNames := htmlquery.Find(row, "./font")
classPosistions := htmlquery.Find(row,"./text()[4]")
classTeachers := htmlquery.Find(row,"./text()[5]")
if len(classNames)! =0 {
className = htmlquery.InnerText(classNames[0])
classPosistion = htmlquery.InnerText(classPosistions[0])
classTeacher = htmlquery.InnerText(classTeachers[0])
fmt.Println(className)
fmt.Println(classPosistion)
fmt.Println(classTeacher)
}
}
Copy the code
4. Save the information
4.1 Use native SQL statements to save data in Mysql
- Define database link parameters
const (
usernameClass = "root"
passwordClass = "root"
ipClass = "127.0.0.1"
portClass = "3306"
dbnameClass = "class"
)
Copy the code
- Connecting to a Database
var DB *sql.DB
func InitDB(a){
path := strings.Join([]string{usernameClass, ":", passwordClass, "@tcp(", ipClass, ":", portClass, "/", dbnameClass, "? charset=utf8"}, "")
DB, _ = sql.Open("mysql", path)
DB.SetConnMaxLifetime(10)
DB.SetMaxIdleConns(5)
iferr := DB.Ping(); err ! =nil{
fmt.Println("opon database fail")
return
}
fmt.Println("connect success")}Copy the code
- Defining data types
type Class struct {
classData string
teacherName string
position string
}
Copy the code
- Insert data
func InsertData(Data Class) bool {
tx, err := DB.Begin()
iferr ! =nil{
fmt.Println("tx fail")
return false
}
stmt, err := tx.Prepare("INSERT INTO class_data (`class`,`teacher`,`position`) VALUES (? ,? ,?) ")
iferr ! =nil{ // Insert data
fmt.Println("Prepare fail",err)
return false
}
_, err = stmt.Exec(Data.classData,Data.teacherName,Data.position) // Execute a transaction
iferr ! =nil{
fmt.Println("Exec fail",err)
return false
}
_ = tx.Commit() // Commit the transaction
return true
}
Copy the code
4.2 Using GORM to save data to Mysql
- Construct GORM model model
type NewD struct {
gorm.Model
Title string `gorm:"type:varchar(255); not null;" `
Time string `gorm:"type:varchar(256); not null;" `
Author string `gorm:"type:varchar(256); not null;" `
Count string `gorm:"type:varchar(256); not null;" `
Content string `gorm:"type:longtext; not null;" `
}
Copy the code
- Connecting to a Database
var db *gorm.DB
func Init(a) {
var err error
path := strings.Join([]string{userName_New, ":", password_New, "@tcp(",ip_New, ":", port_New, "/", dbName_New, "? charset=utf8"}, "")
db, err = gorm.Open("mysql", path)
iferr ! =nil {
panic(err)
}
fmt.Println("SUCCESS")
_ = db.AutoMigrate(&NewD{})
sqlDB := db.DB()
sqlDB.SetMaxIdleConns(10)
sqlDB.SetMaxOpenConns(100)}Copy the code
- Write data
NewA := NewD{
Title: Data.title,
Time: Data.time,
Author: Data.author,
Count: Data.count,
Content: Data.content,
}
err = db.Create(&NewA).Error // Create a data item in the database
Copy the code