This article is an original article. Please scan the code to follow the public account flysnow_org or www.flysnow.org/, and read the following wonderful articles for the first time. If you feel good, feel free to share it in moments, thanks for your support.
In the process of crawler, we need to process the crawler content, such as extracting the content and text we need, such as city information, personnel information and so on. In addition to string search, using regular matching is a more elegant and convenient scheme.
In this article, the date and article name in the URL are used as examples to illustrate how to use the re to extract strings.
Such as the URL http://www.flysnow.org/2018/01/20/golang-goquery-examples-selector.html, this is the first time to write an article about the use of goquery tutorial. From this URL we can see the date information with year, month and day, as well as the name of the last article information, such a URL, how do we get this information? This is where regular expression grouping comes in.
Groups of regular expressions are represented by parentheses (). Each pair of parentheses is a matched text that can be extracted.
From the above URl analysis, we define the regular expression as follows:
^http://www.flysnow.org/([\d]{4})/([\d]{2})/([\d]{2})/([\w-]+).html$
Copy the code
^$represents the start and end of a match, respectively, and defines the scope of our regular expression.
[\d]{4} means that we need to match exactly four digits. Since the year is four digits, we define it to match four digits. The following month and day are two places, so we define it as two places.
[\w-] matches the string and the middle bar, with the plus sign (+) indicating one or more matches.
Then they all add parentheses (), which means we’re going to extract these strings.
Take a look at the complete source code.
flysnowRegexp := regexp.MustCompile(`^http://www.flysnow.org/([\d]{4})/([\d]{2})/([\d]{2})/([\w-]+).html$`)
params := flysnowRegexp.FindStringSubmatch("http://www.flysnow.org/2018/01/20/golang-goquery-examples-selector.html")
for _,param :=range params {
fmt.Println(param)
}
Copy the code
Run the printout:
http://www.flysnow.org/2018/01/20/golang-goquery-examples-selector.html
2018
01
20
golang-goquery-examples-selector
Copy the code
The FindStringSubmatch method extracts the matching string and returns it through [] String. We can see that the first match is the string itself, and the second match is the string we want.
fmt.Println("The year is:"+params[1])
fmt.Println("Month is:"+params[2])
fmt.Println("Days are:"+params[3])
fmt.Println("Article title:"+params[4])
Copy the code
The information we need for this article is then extracted.
Regex are great for processing articles, and for more on the use of Golang regex, see the official introduction to regular expressions. Github.com/google/re2/…
This article is an original article. Please scan the code to follow the public account flysnow_org or www.flysnow.org/, and read the following wonderful articles for the first time. If you feel good, feel free to share it in moments, thanks for your support.