0 x0 tips before reading
Golang basic syntax, BASIC knowledge of HTML, CSS, JS. I’ve heard of regular expressions and GOLang’s HTTP.
The purpose of this article: to record a minimalist crawler script entry to the development. Only for learning use, can not cause loss to the site.
0x1 First introduction to a crawler
The wiki:
Web crawler (spider) is a kind of web robot used to browse the World Wide Web automatically. Its purpose is generally to compile web indexes. For example, web search engines update their own web content or their indexes to other web sites through crawler software. Web crawlers can save the pages they visit so that search engines can generate indexes for users to search.
Crawlers visiting web sites consume target system resources. Many network systems do not acquiesce to crawler work. Therefore, crawlers need to consider planning, load, and “courtesy” when visiting a large number of pages. Public sites that do not want to be accessed by crawlers and known by crawler owners can avoid access by using methods such as robots.txt files. This file can require the bot to index only a portion of the site, or to do nothing at all.
My current understanding is that crawlers can access web pages, but the content they get is not necessarily the same as what the browser loads (which involves loading some resources). Most web sites have basic anti-crawling, but can be cracked by setting Headers.
Today’s harvest: Reworked the overall development steps of a minimalist crawler.
TODO:
- Learn about golang’s HTTP package
- Hand crawler general framework, modular entire project
- Crawler goes deep…
0x2 Minimalist crawler development steps
- Creating a Client
- Create an HTTP request
- Add a Header to the request Header
- The client sends a request request
- The client receives a response
- Response. Body is decoded and analyzed. If it is not UTF-8, it is converted to UTF-8 encoding
- Parse the retrieved content: Extract the required information using regular expressions and format the output
- results
As shown in the figure below
0x03 code + comment parsing
As follows:
package main
import (
"bufio"
"fmt"
"golang.org/x/net/html/charset"
"golang.org/x/text/encoding"
"golang.org/x/text/transform"
"io"
"io/ioutil"
"log"
"net/http"
"regexp"
)
func main(a) {
url := "https://www.bilibili.com/v/popular/rank/all"
// Create a client
client := &http.Client{}
// Create a request
req, err := http.NewRequest("GET", url, nil)
iferr ! =nil {
log.Fatalln(err)
}
/ / set the Header
req.Header.Set("User-Agent"."Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36")
// The client initiates a request and receives a response
resp, err := client.Do(req)
iferr ! =nil {
log.Fatalln(err)
}
defer resp.Body.Close()
// If the access fails, print the current status code
ifresp.StatusCode ! = http.StatusOK { fmt.Println("Error: status code", resp.StatusCode)
return
}
// We need to get the encoding of the retrieved content
e := determineEncodeing(resp.Body)
// Get the utF-8 encoded content
utf8Reader := transform.NewReader(resp.Body, e.NewDecoder())
// Read all the information on the page
all, err := ioutil.ReadAll(utf8Reader)
iferr ! =nil {
panic(err)
}
// Prints information
fmt.Printf("%s",all)
// Parses and prints the retrieved content
printTitle(all)
}
func determineEncodeing(r io.Reader) encoding.Encoding {
bytes, err := bufio.NewReader(r).Peek(1024)
iferr ! =nil {
panic(err)
}
e, _, _ := charset.DetermineEncoding(bytes, "")
return e
}
func printTitle(contents []byte) {
// Regular expression, used to match content, + means at least one content, ^ means cannot contain content, parentheses are taken into the array
re := regexp.MustCompile(`<a href="//(www.bilibili.com/video/[0-9a-zA-Z]+)" target="_blank" class="title">([^<]+)</a>`)
// -1 finds all matching strings
matches := re.FindAllSubmatch(contents, - 1)
/ / print
for _, m := range matches {
fmt.Printf("Title: %s, URL:%s\n", m[2], m[1])
}
fmt.Printf("matches: %d\n".len(matches))
}
Copy the code
0 x04 epilogue
I had no inspiration and no patience today, so I spent two and a half hours writing this. You have the energy to continue to replenish later.