Go impression

One day in 2018, I asked the backend architect at my company if I wanted to learn a backend language other than Java, and he told me he was learning Go. Then he told me something like distributed, coroutine, big data, crawler…… Bala bala is a concept I don’t quite understand either. Then I said I’d better learn NodeJs.

The reason why I dare to fight the Go language again is that Leong (last essay) gave me the courage. If you’re the big guy in the back end, just read the second half of the crawler section.

The topic of this article is the zoom, Crawler, gold column, the purpose is to use Go to write a gadget to crawl down the gold column and read it slowly.

Let’s Go

advantage

  • Simple syntax, easy to use (only 25 keys reserved)
  • High performance, fast compilation, development efficiency than Python and Ruby
  • It is easy to deploy, has small compiled packages, and has almost no dependencies (binary packages run directly), much like Deno
  • Native support for concurrency (Goroutine)
  • Official unified specifications (GOFMT, Golint) have seen the shadow of Deno again
  • Rich standard library, again see Deno shadow

trend

The trend of authority, which has been made clear by The Biggie, I’ll add the trend of GitHub’s Star:

Introduction of the Go

Go is a statically strongly typed, compiled, combined, and garbage collected programming language developed by Google. It is sometimes referred to as a Golang for easy search and identification.

Features of Go language

  1. Go is a new language, a static language with concurrent support, garbage collection, and fast compilation.
  2. Go provides basic support for concurrent execution and communication and is a natural high performance service development language.
  3. Go combines the ease of an interpreted language, the efficiency of a dynamically typed language, and the safety of a statically typed language.
  4. Go takes only a few seconds to compile a large Go program and is very easy to deploy.
  5. Go has the development efficiency of Python/Ruby, but also the performance of C (with a certain gap).
  6. Go is easy to use (only 25 reserved keys)
  7. Go has its own development specifications and tooling support.

Go Installation and Configuration

The author also has a programmer’s Mac development environment [continuously updated], which records the development environment on my Mac, readers can greatly give a Star?

$ brew install go
Copy the code

Tip 1: CTRL + C to skip Updating Homebrew… , or you doubt life.

Tip 2: If you have time to wait, use the -verbose parameter so that the download will tell you the progress of the update.

Tip 3: Homebrew is a synchronized GitHub repository, so if you have a real card, switch to your own Homebrew proxy source

After the installation, check the GO version:

$Go version Go version GO1.14.7 Darwin/AMd64Copy the code

Configure environment variables:

$ open /usr/local/Cellar/go/
Copy the code

And then see where their libexec and then record the whole address, my address is/usr/local/Cellar/go / 1.14.7 / libexec

We need to write this part to nano ~/.zshrc:

#GO
export GOROOT=/usr/local/ Cellar/go / 1.14.7 / libexecexport GOPATH=~/.go
export PATH=${PATH}:$GOPATH/bin
Copy the code

You must run the source ~/.zshrc command to make the command take effect. Then run the go env command to check whether it is successful

Output is too long, please open it to check!!
GO111MODULE="on" GOARCH="amd64" GOBIN="" GOCACHE="/Users/yangjunning/Library/Caches/go-build" GOENV="/Users/yangjunning/Library/Application Support/go/env" GOEXE="" GOFLAGS="" GOHOSTARCH="amd64" GOHOSTOS="darwin" GOINSECURE="" GONOPROXY="" GONOSUMDB="" GOOS="darwin" GOPATH="/Users/yangjunning/go" GOPRIVATE="" GOPROXY = "https://goproxy.cn, direct" GOROOT = "/ usr/local/Cellar/go / 1.14.7 / libexec" GOSUMDB = "sum.golang.org" GOTMPDIR = "" GOTOOLDIR = "/ usr/local/Cellar/go / 1.14.7 / libexec/PKG/tool/darwin_amd64" GCCGO = AR = "AR" CC "GCCGO" = "clang" CXX = "clang++" CGO_ENABLED="1" GOMOD="/dev/null" CGO_CFLAGS="-g -O2" CGO_CPPFLAGS="" CGO_CXXFLAGS="-g -O2" CGO_FFLAGS="-g -O2" CGO_LDFLAGS="-g -O2" PKG_CONFIG="pkg-config" GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/zn/17xnqr8s1pjbpzt9_t38tyhc0000gn/T/go-build998676802=/tmp/go-build -gno-record-gcc-switches -fno-common"Copy the code

Seven niuyun mirror agent

Open your terminal and execute, Go 1.13 or above is feasible, other versions please read Goproxy China view documentation

$ go env -w GO111MODULE=on
$ go env -w GOPROXY=https://goproxy.cn,direct
Copy the code

Go Common commands

1, go build: to compile the specified source files or packages and their dependencies

2, go clean: this is used to remove the compiled files from the current source package

3, Go doc: Print the document attached to the GO language program entity. We can view the documentation of the program entity by taking its identifier as an argument to the command.

4. Go FMT: Helps format your code files, you just need to run go FMT xxx.go your code will be modified to standard format

5, go get: according to the requirements and actual situation from the Internet to download or update the specified code level 1 dependency package, and compile and install them

6, go install: compile and install the specified code packages and their dependencies

7, go run: can traverse the source code and run the command source file

The Go standard library

Sync: Provides basic synchronization primitives. When multiple Goroutines access a shared resource, they need to use the locking mechanism provided in Sync.

2. OS: Provides non-platform-specific access to operating system functions. The interface is UNIX-style. The functions include file operation, process management, signals, and user accounts.

3. Time: Time related processing

4, FMT: the realization of formatted input and output operations.

5. IO: Implements a series of non-platform-specific IO interfaces and implementations, such as providing the encapsulation of system-specific IO functions in the OS. We use this package when we do streaming reads and writes, such as reading and writing files.

6. HTTP: Provides Web services

7. String: a collection of functions that handle strings, including merge, find, split, compare, check for suffixes, index, case handling, and so on.

VsCode plug-in recommended

  • Go: Rich Go Language Support for Visual Studio Code

Hello World

Create helloWorld. go with the following:

package main  // Package declaration statement.
import "fmt" // The system package is used for output

func main(a) {
  // Print the function call statement. This command is used to print information.
  fmt.Println(sayHello("Nuggets"))}func sayHello(juejin string) string {
  return "Hello "+juejin
}
Copy the code

Then execute go run helloworld. Go, ok, you have started, the following can open the learning of the crawler, the next I will step by step with you to achieve a crawl dig column and turn into Markdown format saved to the local crawler, named Juejin -spider.

What is a reptile

Definition of web crawler in Baidu Encyclopedia and Wikipedia: Simply put, a crawler is a tool that crawls the content of the target website. Generally, a crawler automatically crawls, analyzes and filters web pages or data according to defined behaviors. Crawl the URL of the web page.

Simply put, it is to download the target web page, and then through parsing, filtering, to redo a series of operations to obtain the data they want and save it in the corresponding format. The general process is as follows:

It’s colloy

Gocolly is a web crawler framework implemented by Go. It currently has 11K+ stars on Github, ranking first among go crawlers. Gocolly is fast and elegant, making more than 1K requests per second on a single core. It provides a set of interfaces in the form of callback function, which can realize any type of crawler. Depending on the GoQuery library, you can select Web elements just like jquery.

Gocolly’s official website is Go-colly.org/, with detailed documentation and sample code. Colly installation:

$ go get -u github.com/gocolly/colly/...
Copy the code

Life’s first reptile

To manage dependencies in go.mod:

module juejin.im/junning

go 1.14

require (
  github.com/gocolly/colly/v2 latest
)
Copy the code

Create a new main.go file and write the code:

It’s not a long piece of code, but I had to read the official documentation and five or six blogs to get it done, just to make my first crawler complete enough.

package main

import (
	"fmt"
	// 1, import colly.
	"github.com/gocolly/colly"
	"github.com/gocolly/colly/extensions"
)

func main(a) {
	// 2, create collector
	c := colly.NewCollector(colly.AllowedDomains("juejin.im")) // Limit the domain name, otherwise the whole web will climb down
	extensions.RandomUserAgent(c)                              // Use a random UserAgent, preferably an agent. So it's not easy to ban
	extensions.Referer(c)                                      // Use the Referrer at the time of the visit, which means the page from which the click was made

	// 3, event listener, callback to perform event processing.
	c.OnError(func(_ *colly.Response, err error) {
		fmt.Println("Something went wrong:", err)
	})
	c.OnRequest(func(r *colly.Request) {
		// fmt.Println("Visiting", r.URL)
	})
	// Find and visit all links
	c.OnHTML("a[href]".func(e *colly.HTMLElement) {
		fmt.Println(e.Text)
		e.Request.Visit(e.Attr("href"))})// 4
	c.Visit("https://juejin.im/")}Copy the code
  • : =The way to declare variables and assign values, JS write more good habits (syntax study recommended everyone to seeComics Go language pure hand-painted version )
  • *clolly.HTMLElementThe syntax for “declare” is for parameter types

Callbacks and the order in which they are called

Gocolly’s principle is to listen to network access and provides 7 events and callbacks for developers to play with.

  1. OnRequest: Called before the request is executed
  2. OnError: called when the request fails
  3. OnResponseHeaders: Called after Response headers received
  4. OnResponse: called after the response is complete
  5. OnHTML: Called immediately after OnResponse if the received content is HTML
  6. OnXML: Called immediately after OnHTML if the content received is HTML or XML
  7. Onmonopoly: called immediately after OnXML

The type definition

HTMLElement and colly.Request type definitions are listed here. You can jump to the type definition file at any time during development, or check github.com/gocolly/col… In the corresponding file can (it is said that the source code is very good, have the time to study). PS: There are a lot of similarities to TypeScript that I can learn from, which is why I was able to get started overnight.

*colly.HTMLElement

Online link: github.com/gocolly/col…

// HTMLElement is the representation of a HTML tag.
type HTMLElement struct {
	// Name is the name of the tag
	Name       string
	Text       string
	attributes []html.Attribute
	// Request is the request object of the element's HTML document
	Request *Request
	// Response is the Response object of the element's HTML document
	Response *Response
	// DOM is the goquery parsed DOM object of the page. DOM is relative
	// to the current HTMLElement
	DOM *goquery.Selection
	// Index stores the position of the current element within all the elements matched by an OnHTML callback
	Index int
}
Copy the code

*colly.Request

Online link: github.com/gocolly/col…

// Request is the representation of a HTTP request made by a Collector
type Request struct {
	// URL is the parsed URL of the HTTP request
	URL *url.URL
	// Headers contains the Request's HTTP headers
	Headers *http.Header
	// Ctx is a context between a Request and a Response
	Ctx *Context
	// Depth is the number of the parents of the request
	Depth int
	// Method is the HTTP method of the request
	Method string
	// Body is the request body which is used on POST/PUT requests
	Body io.Reader
	// ResponseCharacterencoding is the character encoding of the response body.
	// Leave it blank to allow automatic character encoding of the response body.
	// It is empty by default and it can be set in OnRequest callback.
	ResponseCharacterEncoding string
	// ID is the Unique identifier of the request
	ID        uint32
	collector *Collector
	abort     bool
	baseURL   *url.URL
	// ProxyURL is the proxy address that handles the request
	ProxyURL string
}
Copy the code

Crawl to the nuggets column

The principle of crawler is to simulate a web page visit, obtain document information, parse the information by various means, and save the data they need.

Because the author is in a hurry to start the Go language overnight, unable to achieve a data crawl crawler, I realize the following:

Visit the dig column details page ⏬ to obtain the specified content and title section ⏬ Use the title as the file name ⏬ convert the content into Markdown format and save the file to ⏬. Save the file locallyCopy the code

Analyze the page structure

Column title structure

<h1 data-v-23a9d5ed="" class="article-title">(. *?) <\/h1>Copy the code

Column Body Structure

<div class="markdown-body">(. *?) <\/div>Copy the code

Get the column title and content

func main(a) {
	c := colly.NewCollector(
		colly.Async(true),
	)

	c.OnRequest(func(r *colly.Request) {
		fmt.Println("Visiting", r.URL)
	})

	c.OnError(func(_ *colly.Response, err error) {
		log.Println("Something went wrong:", err)
	})

	c.OnHTML(".article-title".func(e *colly.HTMLElement) {
		// Code Here
	})

	c.OnHTML(".markdown-body".func(e *colly.HTMLElement) {
		// Code Here
	})

	c.Visit("https://juejin.im/post/" + *post)
	c.Wait()
}
Copy the code
  • colly.NewCollectorI added a term to itcolly.Async(true), refers to the asynchronous grasp, which can significantly improve the speed of grasp
  • useOnHTMLEvents are grabbed separately witharticle-titleand.markdown-bodyThis edge is the logical implementation of the function we want to implement next.
  • OnHTMLThe first parameter is compliant with the CSS selector rules. You can use any selector to do things.

Turn the HTML for Markdown

Here we are taking advantage of the functionality provided by the HTML-to-MarkDown library and encapsulating it briefly:

// Convert Html to Markdown
func convertHTMLToMarkdown(selection *goquery.Selection) string {
	converter := md.NewConverter("".true.nil)
	markdown := converter.Convert(selection)
	return markdown
}
Copy the code

Save the file to the local directory

.// Write to the file
func writeFile(fileName string,content string) {
	filePath := fileName + ".md"
	var file *os.File

	if checkFileIsExist(filePath) {
		// If the file exists, delete it
		err := os.Remove(filePath)
		iferr ! =nil {
			log.Fatal(err)
		}
	}

	// Create the file and write the content
	file, _ = os.Create(filePath)
	n, _ := io.WriteString(file, "## "+fileName+"\n\n"+content)
	// Close the file
	file.Close()
	if n == 0 {
		return}}// Check whether the file exists
func checkFileIsExist(fileName string) bool {
	_, err := os.Stat(fileName)
	iferr ! =nil {
		return false
	}
	return true}...Copy the code
  • os.Stat: Used to get a file or file information based on its encapsulationcheckFileIsExistCheck whether the file exists
  • os.Create+io.WriteStringTo achieve the creation of files and write files
  • If the file existsos.Remove(filePath)Delete files to overwrite files (don’t bother to see how to overwrite files)

Get command line arguments

I use NodeJS to write the CLI tool to write the same way, interested can have a look, next time not necessarily, just this time.

func main(a) {
	var post = flag.String("post"."6859538537830858759"."Article Number")
	var rootDir = flag.String("root", root, "Root directory where files are saved")
  flag.Parse()
}
Copy the code

Getting environment variables

The GO language does not support the use of the ~ number to represent the home directory. After some trouble, we found this solution. It is not like Deno, the language design is copied from each other.

os.Getenv("HOME")
Copy the code

Get photo

When testing the script found that there are pictures of the article, the picture is lost, how this line, no pictures of the article is no soul. The result of the analysis is that the Nuggets’ image is lazily loaded and the tag looks something like this:

<img class="lazyload inited loaded" data-src="https://i.loli.net/2020/08/13/cVomW7L9YOTw2uA.png" data-width="800" data-height="600" src="https://i.loli.net/2020/08/13/cVomW7L9YOTw2uA.png">
Copy the code

I guess this is the problem with the data-attribute, so I added the following code to the script to delete the data- :

. reg := regexp.MustCompile(`data-`)
		html, _ := e.DOM.Html()
		markdown := convertHTMLToMarkdown(reg.ReplaceAllString(html, ""))...Copy the code

Youngjuning /[email protected] has been released and is perfect for grabbing nuggets columns.

The finished product

The code is too long, the source code is here “github.com/youngjuning… “” “” “”, all read here, give a star.

Package and publish scripts using Homebrew

Hackers have to pursue, it is impossible to make a toy. And because Go itself runs without any dependencies or environment, I can’t ask people who use the tool to install a Go environment. My first thought was to send my script to Homebrew, and thankfully the Homebrew release script explains the process in great detail.

1. Package it as an executable

$ go build juejin.go
Copy the code

An executable file named juejin is generated in the current directory./juejin can be executed. Go build -o=/usr/local/bin juejin.go or go build -o=$GOPATH/bin/ juejin.go to the registered system path.

2. Package the executable file in the tar.gz format

$The tar ZCVF juejin_0. 0.1. Tar. Gz juejin
Copy the code

Upload to Git for the recipe soft link to the script file.

3, use,brew create <git-url> --tab user/repoCreate a prescription

The brew the create $\ \ https://github.com/youngjuning/homebrew-juejin-spider/raw/master/juejin_0.0.1.tar.gz - tap youngjuning/homebrew-juejin-spiderCopy the code

We need to make some adjustments to the installation:

def install
    bin.install "juejin"
end
Copy the code

When you’re done, save and commit to Git.

4. Installation script

$ brew install youngjuning/juejin-spider/juejin
Copy the code

Run the juejin -h command to check whether it is successful:

$ juejin -hUsage of juejin: - the post string article number (the default "6859538537830858759") - the root string file save the root directory of the (default "/ Users/yangjunning/juejin")Copy the code

Install your own scripts on someone else’s device

// When this command is executed, brew will automatically update its formula repository, which will take a few minutes... $brew tap youngjuning/juejin - spiders/https://github.com/youngjuning/homebrew-juejin-spider.git/download, install, $the brew the install script  youngjuning/juejin-spider/juejinCopy the code

Super Saiyan town

Thank you for your patience to read this article, like is equal to learn, collection is equal to master, like and collection is true love!! Also look forward to discussing with me in the comments section!!

🏆 technology project phase ii | and I Go those things…