This article is an original article. Please scan the code to follow the public account flysnow_org or www.flysnow.org/, and read the following wonderful articles for the first time. If you feel good, feel free to share it in moments, thanks for your support.
In recent studies on the knowledge related to Go crawler, goQuery library is widely used, especially in the selection of the HTML crawled and the search for matched content, goQuery selector is especially widely used, and there are many not commonly used but very useful selectors, summarized here for reference.
For those of you who have done front-end development before and are familiar with jquery, GoQuery is similar to jquery in that it is the gO version of jquery. Using it, you can easily process HTML.
A selector based on the HTML Element
Use the Element name as a selector. Such as dom. Find (” div “).
func main() { html := `<body> <div>DIV1</div> <div>DIV2</div> <span>SPAN</span> </body> ` dom,err:=goquery.NewDocumentFromReader(strings.NewReader(html)) if err! =nil{ log.Fatalln(err) } dom.Find("div").Each(func(i int, selection *goquery.Selection) { fmt.Println(selection.Text()) }) }Copy the code
In the example above, div elements can be filtered, but body and span are not filtered.
The ID selector
This is the one that’s used the most, like the example above, where we have two div elements, and we only need one of them, so all we need to do is give this tag a unique ID, and then we can use the ID selector to pinpoint the location.
func main() { html := `<body> <div id="div1">DIV1</div> <div>DIV2</div> <span>SPAN</span> </body> ` dom,err:=goquery.NewDocumentFromReader(strings.NewReader(html)) if err! =nil{ log.Fatalln(err) } dom.Find("#div1").Each(func(i int, selection *goquery.Selection) { fmt.Println(selection.Text()) }) }Copy the code
Element ID selector
The id selector starts with #, followed by the value of the element ID, using the syntax dom.find (#id). I will abbreviate the following examples to Find(# ID), which stands for the GoQuery selector.
What if you have the same ID, but they belong to different HTML elements? Good idea, combine it with Element. For example, if we filter for elements whose element is div and whose ID is div1, we can use a filter like Find(div#div1).
So the syntax for this type of filter is Find(element#id), which is a common way to combine filters, such as those described below.
The Class selector
Class is also a common attribute in HTML. You can quickly filter HTML elements using the class selector, which is used similarly to the ID selector, Find(“.class”).
func main() { html := `<body> <div id="div1">DIV1</div> <div class="name">DIV2</div> <span>SPAN</span> </body> ` dom,err:=goquery.NewDocumentFromReader(strings.NewReader(html)) if err! =nil{ log.Fatalln(err) } dom.Find(".name").Each(func(i int, selection *goquery.Selection) { fmt.Println(selection.Text()) }) }Copy the code
In the example above, the div element with class name is filtered out.
Element Class selector
Class selectors, like ID selectors, can be used in conjunction with HTML elements, and their syntax is similar to Find(element.class), allowing you to filter for a particular element and specify elements of the class.
Property selector
An HTML element has its own attributes and attribute values, so you can filter elements by attributes and values as well.
func main() { html := `<body> <div>DIV1</div> <div class="name">DIV2</div> <span>SPAN</span> </body> ` dom,err:=goquery.NewDocumentFromReader(strings.NewReader(html)) if err! =nil{ log.Fatalln(err) } dom.Find("div[class]").Each(func(i int, selection *goquery.Selection) { fmt.Println(selection.Text()) }) }Copy the code
In the example, the div[class] selector is used to filter Element as a div with the class attribute, so the first div is not filtered.
In the same way that the previous example uses the presence or absence of an attribute as a filter, we can filter elements with attributes of a value.
dom.Find("div[class=name]").Each(func(i int, selection *goquery.Selection) {
fmt.Println(selection.Text())
})Copy the code
This allows us to filter out div elements whose class attribute is name.
Of course, we can use the class attribute here, but you can also use other attributes, href and so on, and you can also use custom attributes.
In addition to perfect equality, there are other matching methods, which are similar. Here are all the examples
The selector | instructions |
---|---|
Find (” div [lang] “) | Filter div elements that contain the lang attribute |
Find (” div [lang = useful] “) | Filter div elements whose lang attribute is zh |
Find (” div [lang! = useful] “) | Filter div elements whose lang attribute is not equal to zh |
Find (” div [lang ¦ = useful] “) | Filter div elements whose lang attribute starts with zh or zh- |
Find (” div [lang * = useful] “) | Filter the div element whose lang attribute contains the string zh |
Find (” div [lang ~ = useful] “) | Filter div elements whose lang attribute contains the word zh, separated by Spaces |
Find (” div [$= useful lang] “) | Filter div elements whose lang attribute ends in zh, case sensitive |
Find (” div [lang ^ = useful] “) | Filter div elements whose lang attribute starts with zh, case sensitive |
Find(“div[id][lang=zh]”), enclosed by brackets. Find(“div[id][lang=zh]”) When there are multiple attribute filters, the elements that satisfy both of these filters are filtered.
The parent > child selector
If we want to filter out child elements of an element, we can use the child element filter, whose syntax is Find(“parent>child”), to filter the most direct child elements of the parent element that match the child condition.
func main() { html := `<body> <div lang="ZH">DIV1</div> <div lang="zh-cn">DIV2</div> <div lang="en">DIV3</div> <span> <div>DIV4</div> </span> </body> ` dom,err:=goquery.NewDocumentFromReader(strings.NewReader(html)) if err! =nil{ log.Fatalln(err) } dom.Find("body>div").Each(func(i int, selection *goquery.Selection) { fmt.Println(selection.Text()) }) }Copy the code
In the example above, the body parent element is screened, and the most direct child element div is screened. The results are DIV1, DIV2, and DIV3. Although DIV4 is also a child element of body, it is not of the first level, so it will not be screened.
So the question is, what if I want to filter out DIV4 as well? Filter all div elements under body, whether they are level 1, level 2, or level N. There is a way, goQuery takes into account, to simply change the greater-than (>) sign to a space. In the example above, change to the following selector.
dom.Find("body div").Each(func(i int, selection *goquery.Selection) {
fmt.Println(selection.Text())
})Copy the code
Prev +next Adjacent selector
If the element we are filtering is not regular, but the previous element is regular, we can use this next-neighbor selector to make the selection.
func main() { html := `<body> <div lang="zh">DIV1</div> <p>P1</p> <div lang="zh-cn">DIV2</div> <div lang="en">DIV3</div> <span> <div>DIV4</div> </span> <p>P2</p> </body> ` dom,err:=goquery.NewDocumentFromReader(strings.NewReader(html)) if err! =nil{ log.Fatalln(err) } dom.Find("div[lang=zh]+p").Each(func(i int, selection *goquery.Selection) { fmt.Println(selection.Text()) }) }Copy the code
So we can use Find(“div[lang=zh]+p”) to select p elements.
This selector has the syntax (“prev+next”) with a plus sign (+) in the middle and a selector before and after the + sign.
This article is an original article. Please scan the code to follow the public account flysnow_org or www.flysnow.org/, and read the following wonderful articles for the first time. Some scandalous websites grab my article will remove copyright information, here to write a paragraph, we forgive.
Prev ~ next selector
There are neighbors and there are brothers, and sibling selectors don’t need to be neighbors as long as they share a parent element.
dom.Find("div[lang=zh]~p").Each(func(i int, selection *goquery.Selection) {
fmt.Println(selection.Text())
})Copy the code
In the example above, just replace the + sign with the ~ sign to filter out P2 as well, because P2, P1, and DIV1 are brothers.
The syntax for sibling selectors is (“prev~next”), which means that the + of neighboring selectors is replaced with ~.
Content filter
Sometimes we use a selector and then we want to filter it out again, and that’s when we use filters, and there are a lot of filters, so let’s talk about content filters.
dom.Find("div:contains(DIV2)").Each(func(i int, selection *goquery.Selection) {
fmt.Println(selection.Text())
})Copy the code
Find(“:contains(text)”) indicates that the selected element contains the specified text. In our example, the selected div element contains DIV2 text, so only one DIV2 element meets the requirement.
In addition, Find(“:empty”) means that none of the filtered elements can have children (including text elements), and only those elements that do not contain any children are filtered.
Find(“:has(selector)”) is similar to contains, except that this contains an element node.
dom.Find("span:has(div)").Each(func(i int, selection *goquery.Selection) {
fmt.Println(selection.Text())
})Copy the code
The above example shows filtering out a SPAN node that contains a div element.
A: first – child filter
:first-child filter, syntax Find(“:first-child”), means that filtered elements are the first child of their parent element, if not, they are not filtered.
func main() { html := `<body> <div lang="zh">DIV1</div> <p>P1</p> <div lang="zh-cn">DIV2</div> <div lang="en">DIV3</div> <span> <div style="display:none;" >DIV4</div> <div>DIV5</div> </span> <p>P2</p> <div></div> </body> ` dom,err:=goquery.NewDocumentFromReader(strings.NewReader(html)) if err! =nil{ log.Fatalln(err) } dom.Find("div:first-child").Each(func(i int, selection *goquery.Selection) { fmt.Println(selection.Html()) }) }Copy the code
In the example above, we used Find(“div”) to filter out all the div elements, but we added :first-child, and only DIV1 and DIV4 were left because they were the first child of their parent element.
A: first – of – the type filter
The :first-child selector must be the first child of the element. If there are other elements before the :first-child selector, the :first-of-type selector can be used. It requires only the first element of the type.
func main() { html := `<body> <div lang="zh">DIV1</div> <p>P1</p> <div lang="zh-cn">DIV2</div> <div lang="en">DIV3</div> <span> <p>P2</p> <div>DIV5</div> </span> <div></div> </body> ` dom,err:=goquery.NewDocumentFromReader(strings.NewReader(html)) if err! =nil{ log.Fatalln(err) } dom.Find("div:first-of-type").Each(func(i int, selection *goquery.Selection) { fmt.Println(selection.Html()) }) }Copy the code
DIV5 cannot be filtered if we use :first-child because it is not the first child and P2 precedes it. In this case, we can use :first-of-type to achieve the purpose, because it requires that the first of the same type can be used. DIV5 is the first element of this div type. P2 is not a div and is ignored.
:last-child and :last-of-type filters
First of type = first of type = first of type = first of type = first of type
: the NTH – child (n) filter
This means that the filtered element is the NTH element of its parent, starting with n at 1. So we know that :first-child is equal to :nth-child(1). By specifying n, we have the flexibility to filter out the elements we need.
func main() { html := `<body> <div lang="zh">DIV1</div> <p>P1</p> <div lang="zh-cn">DIV2</div> <div lang="en">DIV3</div> <span> <p>P2</p> <div>DIV5</div> </span> <div></div> </body> ` dom,err:=goquery.NewDocumentFromReader(strings.NewReader(html)) if err! =nil{ log.Fatalln(err) } dom.Find("div:nth-child(3)").Each(func(i int, selection *goquery.Selection) { fmt.Println(selection.Html()) }) }Copy the code
This example filters out DIV2 because DIV2 is the third child of its parent element body.
: the NTH – of – type (n) filter
:nth-of-type(n) is similar to :nth-child(n), except that it represents the NTH element of the same type, so :nth-of-type(1) is equal to :first-of-type.
Nth-last-child (n) and: NTH-last-type (n) filters
These are similar to the above, except they start in reverse order, and the last element is treated as the first. Test it out for yourself. It’s pretty obvious.
A: only – child filter
The Find(“:only-child”) filter, as you might guess, represents the filtered element. Within its parent element, only itself, whose parent element has no other children, will be matched and filtered.
func main() { html := `<body> <div lang="zh">DIV1</div> <span> <div>DIV5</div> </span> </body> ` dom,err:=goquery.NewDocumentFromReader(strings.NewReader(html)) if err! =nil{ log.Fatalln(err) } dom.Find("div:only-child").Each(func(i int, selection *goquery.Selection) { fmt.Println(selection.Html()) }) }Copy the code
In the example, DIV5 can be filtered because it is its parent, span is the only child, but DIV1 is not, so it cannot be filtered.
A: only – of – the type filter
In the example above, what if you want to filter out DIV1? You can use Find(“:only-of-type”) because it is the only div element in its parent element. That’s what the only-of-type filter does. If there is only one element of the same type, you can filter it out. Change the above example to “only of type” and see if DIV1 exists.
Selector or (|) operations
What if we want to filter out elements like div and SPAN? This time can use multiple selectors are combined to use, and a comma (,) segmentation, a Find (” selector1, selector2, selectorN “) said that as long as meet one selector can be filtered, namely selector or (|) arithmetic operations.
func main() { html := `<body> <div lang="zh">DIV1</div> <span> <div>DIV5</div> </span> </body> ` dom,err:=goquery.NewDocumentFromReader(strings.NewReader(html)) if err! =nil{ log.Fatalln(err) } dom.Find("div,span").Each(func(i int, selection *goquery.Selection) { fmt.Println(selection.Html()) }) }Copy the code
summary
Goquery is a necessary tool for parsing HTML web pages. In the process of crawler grasping web pages, the flexible use of different selectors of GoQuery can make us get twice the result with half the effort and greatly improve the efficiency of crawler.
This article is an original article. Please scan the code to follow the public account flysnow_org or www.flysnow.org/, and read the following wonderful articles for the first time. If you feel good, feel free to share it in moments, thanks for your support.
Sweep yards attention