This is the 23rd day of my participation in Gwen Challenge

Share a wave of GO crawlers

Let’s go back to the last time we talked about using GOLANG to send emails

Golang+ Chromedp + GoQuery Simple crawl dynamic data | Go theme month

  • Shared email. What is email
  • What are the mail protocols
  • How do I use GOLANG to send email
  • How to send an email with plain text,HTMLContents, attachments, etc
  • How to cc or BCC an email
  • How can I improve the performance of sending emails

If you want to see how to use GOLANG to send email, please check out the article how to use GOLANG to send email

Remember earlier that we briefly shared an article about Golang+ Chromedp + GoQuery crawling dynamic data | Go theme month

If you’re interested, we can take a closer look at how the Chromedp framework works

Today let’s share static data for crawling web pages using GO

What are static and dynamic web pages

What is static web data?

  • There is no program code in the web page, onlyHTMLThat is, only hypertext Markup language, the suffix is usually.html , htm , xml, etc.
  • Another feature of a static web page is that the user can directly click to open it, regardless of the content of the page opened by anyone at any time is unchanged,htmlIf the code is fixed, the effect is fixed

By the way, what is a dynamic web page

  • Dynamic web is a web programming technique

    In addition to HTML markup, dynamic web page files also include some specific functions of the program code

    These code is mainly used for browser and server can interact, the server can dynamically generate web content according to the different requests of the client, very flexible

That is to say, although the page code of a dynamic web page has not changed, the content displayed can change with the passage of time, different environments, and database changes

GO to crawl static data from a web page

Let’s crawl static web pages, for example, let’s crawl static data on this site, the account and password information on the web page

http://www.ucbug.com/jiaocheng/63149.html?_t=1582307696

Let’s climb the steps of this site:

  • Specify a specific site to crawl
  • throughHTTP GETTo get the data
  • Converts byte arrays to strings
  • Use regular expressions to match the content we want (this is important because crawling static pages, processing data, and filtering data takes a lot of time)
  • Data screening, de-duplication and other operations (this step varies from person to person according to individual needs, depending on the target website)

Let’s write a DEMO, climb the above urlAccount and password informationThe information we want on the website is that we only use it for learning and should not use it to do something bad

package main

import (
   "io/ioutil"
   "log"
   "net/http"
   "regexp"
)

const (
   // Regular expression to match the XL account password
   reAccount = ` (bank account | thunderbolt) (; | :) [0-9:] + (|) password: [0-9 a zA - Z] + `
)

// Get the website account password
func GetAccountAndPwd(url string) {
   // Get the website data
   resp, err := http.Get(url)
   iferr ! =nil{
      log.Fatal("http.Get error : ",err)
   }
   defer resp.Body.Close()

   // Read bytes
   dataBytes, err := ioutil.ReadAll(resp.Body)
   iferr ! =nil{
      log.Fatal("ioutil.ReadAll error : ",err)
   }

   // The byte array is converted to a string
   str := string(dataBytes)

   // Filter XL accounts and passwords
   re := regexp.MustCompile(reAccount)

   // Match how many times, -1 is all by default
   results := re.FindAllStringSubmatch(str, - 1)

   // Output the result
   for _, result := range results {
      log.Println(result[0])}}func main(a) {
   // Simply set the log parameter
   log.SetFlags(log.Lshortfile | log.LstdFlags)
   // Pass in the website address, and the crawl begins to crawl data
   GetAccountAndPwd("http://www.ucbug.com/jiaocheng/63149.html?_t=1582307696")}Copy the code

The result of running the above code is as follows:

2021/06/xx xx:05:25 main.go:46Account:357451317Password:110120a
2021/06/xx xx:05:25 main.go:46Account:907812219Password:810303
2021/06/xx xx:05:25 main.go:46Account:797169897Password: zxcvbnm1322021/06/xx xx:05:25 main.go:46: Thunder account:792253782:1Password:283999
2021/06/xx xx:05:25 main.go:46: Thunder account:147643189:2Password:344867
2021/06/xx xx:05:25 main.go:46: Thunder account:147643189:1Password:267297
Copy the code

It can be seen that the data at the beginning of the account and the data at the beginning of the Thunderbolt account are all climbed down by us. In fact, it is not difficult to climb the content of static web pages. The time is basically spent on regular expression matching and data processing

According to the above steps, we can list the following:

  • Visit the web sitehttp.Get(url)
  • Read data contentioutil.ReadAll
  • Data is converted to a string
  • Set a rule for regex matchingregexp.MustCompile(reAccount)
  • Start filtering data, you can set the amount of filteringre.FindAllStringSubmatch(str, -1)

In practice, of course, it won’t be that simple,

For example, the format of the data extracted by myself is not uniform enough on the website. There are many special characters and they are miscellaneous and irregular. Even the data is dynamic and cannot be obtained through Get

However, the above problems can be solved, according to different problems, design different solutions and data processing, I believe that friends will be able to solve this problem, in the face of the problem, we should have the determination to solve the problem

Climb take pictures

Looking at the example above, let’s try crawling the image data on the web, such as searching for shiba inu at a certain degree

It’s a page like this

Let’s copy and paste the URL from the URL address bar and use it to crawl data

https://image.baidu.com/search/index?tn=baiduimage&ps=1&ct=201326592&lm=-1&cl=2&nc=1&ie=utf-8&word=%E6%9F%B4%E7%8A%AC

Due to the large number of pictures, we set the data matching only two pictures

Let’s take a look at the DEMO

  • By the way, willGet urlThe data, converted into a string function, is extracted and packaged into a small function
  • GetPic, you can set the number of times a match is made, as we did here2
package main

import (
   "io/ioutil"
   "log"
   "net/http"
   "regexp"
)

const (
   // Regular expression to match the XL account password
   reAccount = ` (bank account | thunderbolt) (; | :) [0-9:] + (|) password: [0-9 a zA - Z] + `
   // Regular expression to match the image
   rePic = `https? : / / / ^ "+? (\.((jpg)|(png)|(jpeg)|(gif)|(bmp)))`
)

func getStr(url string)string{
   resp, err := http.Get(url)
   iferr ! =nil{
      log.Fatal("http.Get error : ",err)
   }
   defer resp.Body.Close()

   // Read bytes
   dataBytes, err := ioutil.ReadAll(resp.Body)
   iferr ! =nil{
      log.Fatal("ioutil.ReadAll error : ",err)
   }

   // The byte array is converted to a string
   str := string(dataBytes)
   return str
}

// Get the website account password
func GetAccountAndPwd(url string,n int) {
   str := getStr(url)
   // Filter XL accounts and passwords
   re := regexp.MustCompile(reAccount)

   // Match how many times, -1 is all by default
   results := re.FindAllStringSubmatch(str, n)

   // Output the result
   for _, result := range results {
      log.Println(result[0])}}// Get the website account password
func GetPic(url string,n int) {

   str := getStr(url)

   // Filter images
   re := regexp.MustCompile(rePic)

   // Match how many times, -1 is all by default
   results := re.FindAllStringSubmatch(str, n)

   // Output the result
   for _, result := range results {
      log.Println(result[0])}}func main(a) {
   // simply set the l og parameter
   log.SetFlags(log.Lshortfile | log.LstdFlags)
   //GetAccountAndPwd("http://www.ucbug.com/jiaocheng/63149.html?_t=1582307696", -1)
   GetPic("https://image.baidu.com/search/index?tn=baiduimage&ps=1&ct=201326592&lm=-1&cl=2&nc=1&ie=utf-8&word=%E6%9F%B4%E7%8A%AC".2)}Copy the code

Running the above code results in the following (with no redo) :

2021/06/xx xx:06:39 main.go:63: https:/ / ss1.bdstatic.com/70cFuXSh_Q1YnxGkpoWK1HF6hhy/it/u=4246005838, & FM = 26 & gp = 0. 1103140037 JPG
2021/06/xx xx:06:39 main.go:63: https:/ / ss1.bdstatic.com/70cFuXSh_Q1YnxGkpoWK1HF6hhy/it/u=4246005838, & FM = 26 & gp = 0. 1103140037 JPG
Copy the code

Sure enough, it is what we want, but just printing out and crawling are all picture links, which cannot meet the needs of our real crawler. We must still download the crawling pictures for our use, which is what we want

Let’s here for demonstration convenience, we are above the above code, plus the small function of downloading files, let’s download the first one

package main

import (
   "fmt"
   "io/ioutil"
   "log"
   "net/http"
   "regexp"
   "strings"
   "time"
)

const (
   // Regular expression to match the image
   rePic = `https? : / / / ^ "+? (\.((jpg)|(png)|(jpeg)|(gif)|(bmp)))`
)

// Get web page data and convert it to a string
func getStr(url string) string {
   resp, err := http.Get(url)
   iferr ! =nil {
      log.Fatal("http.Get error : ", err)
   }
   defer resp.Body.Close()

   // Read bytes
   dataBytes, err := ioutil.ReadAll(resp.Body)
   iferr ! =nil {
      log.Fatal("ioutil.ReadAll error : ", err)
   }

   // The byte array is converted to a string
   str := string(dataBytes)
   return str
}

// Get image data
func GetPic(url string, n int) {

   str := getStr(url)

   // Filter images
   re := regexp.MustCompile(rePic)

   // Match how many times, -1 is all by default
   results := re.FindAllStringSubmatch(str, n)

   // Output the result
   for _, result := range results {
      // Get the specific image name
      fileName := GetFilename(result[0])
	 // Download the image
      DownloadPic(result[0], fileName)
   }
}

// Get the file name
func GetFilename(url string) (filename string) {
   // Find the index of the last =
   lastIndex := strings.LastIndex(url, "=")
   // Gets the string after /, which is the source file name
   filename = url[lastIndex+1:]

   // Add the timestamp to the original name to create a new name
   prefix := fmt.Sprintf("%d",time.Now().Unix())
   filename = prefix + "_" + filename

   return filename
}

func DownloadPic(url string, filename string) {
   resp, err := http.Get(url)
   iferr ! =nil {
      log.Fatal("http.Get error : ", err)
   }
   defer resp.Body.Close()

   bytes, err := ioutil.ReadAll(resp.Body)
   iferr ! =nil {
      log.Fatal("ioutil.ReadAll error : ", err)
   }

   // Path to the file
   filename = ". /" + filename

   // Write the file and set the file permissions
   err = ioutil.WriteFile(filename, bytes, 0666)
   iferr ! =nil {
      log.Fatal("wirte failed !!", err)
   } else {
      log.Println("ioutil.WriteFile successfully , filename = ", filename)
   }
}

func main(a) {
   // simply set the l og parameter
   log.SetFlags(log.Lshortfile | log.LstdFlags)
   GetPic("https://image.baidu.com/search/index?tn=baiduimage&ps=1&ct=201326592&lm=-1&cl=2&nc=1&ie=utf-8&word=%E6%9F%B4%E7%8A%AC".1)}Copy the code

Above code, we added two functions to assist us to modify the name of the picture and download the picture

  • Get the name of the file, rename it with a timestamp,GetFilename
  • Download the specific image, download it to the current directory,DownloadPic

Running the code above, you can see the following effect

2021/06/xx xx:50:04 main.go:91: ioutil.WriteFile successfully , filename =  ./1624377003_0.jpg
Copy the code

An image named 1624377003_0.jpg has been downloaded from the current directory

Below is the image photo of the specific picture

Some big brothers will say, IT is too slow for me to download the picture by one coroutine, can you download the picture faster, more coroutines to download the picture together

And climb up our shiba inu

Remember the GO channel and sync package we talked about earlier? GO channel and sync package sharing, just can practice, this small function is relatively simple, say the general idea, if you are interested, you can GO to implement a wave,

  • Let’s read the above url and convert it to a string
  • Use regex to match a series of image links
  • Place each image link in a buffered channel with a buffer of 100 for the time being
  • Let’s open 3 more coroutines to read the data of the channel concurrently and download it to the local. The modification method of file name can refer to the above encoding

How about, big brothers and small friends, if you are interested, you can practice a wave. If you have ideas about crawling dynamic data, we can communicate and discuss about it, and make progress together

conclusion

  • Share a brief description of static and dynamic web pages
  • GO crawls simple data from static web pages
  • GO crawls pictures on web pages
  • And crawl resources on the web page

Welcome to like, follow and favorites

Friends, your support and encouragement, I insist on sharing, improve the quality of the power

Ok, this time here, next GO application and sharing of GJSON

Technology is open, our mentality, should be more open. Embrace change, live in the sun, and strive to move forward.

I am Nezha, welcome to like, see you next time ~