Net Core uses HtmlAgilityPack to parse and collect Internet data

HtmlAgilityPack should count. Net under the best use of HTML parsing library.

Because I was collecting some data for my friends recently, I searched several libraries in Nuget, and finally decided to use HtmlAgilityPack. And simply write down the posture used.

Install directly using the Nuget package

Install - Package HtmlAgilityPack - Version 1.11.16Copy the code

1. Download the web page

The library provides a class for downloading web pages: HtmlWeb

var webGet = new HtmlWeb(); 
var document = webGet.Load(url);
Copy the code

If the network is working, you’ll get an HtmlDocument object. All subsequent operations are based on this class.

I personally prefer to use HttpClient to download web pages and then use HtmlDocument to LoadHtml. Because they use HTMLClient controllability is higher. For example, to add the proxy IP address, randomly set the UA and other operations.

Of course, the simple use of HtmlWeb is almost the same.

2. Parse the web page

The first step is to get the htmlDocument object, which provides a lot of operations.

For example, if we want to get the author of an article on a web page, go to Chrome and right click -> Review Elements -> Elements -> Right-click ->Copy->Copy Xpath

Document. DocumentNode. SelectSingleNode (" Chrome copy xpath ")? .InnerTextCopy the code

And then we managed to get the author’s name

How do I parse a list?

In the case of blogs, the home page is a list of articles. How do we get all the items in this list?

Var nodes = document. DocumentNode. SelectNodes (" xpath expression ")Copy the code

If you’re familiar with xpath, you know that you can get multiple nodes with a double slash. //div[@class=’post_item’], and then get a collection of HtmlDocument, and then parse the collection of child nodes, on the line!

How do I delete labels?

In some articles, the “A” tag is hidden to poison the bulk collection.

You can directly use the Descendants method to find all a tags and remove them

Var aNodes = HtmlDocument object obtained. DocumentNode.Descendants("a") foreach (var anode in aNodes.ToArray()) { anode.Remove(); }Copy the code

What if you want to take pictures?

The general image address is placed on the img SRC property,

var imgNodes = detail.DocumentNode.Descendants("img");
            foreach (var img in imgNodes)
            {
                string imgurl = img.GetAttributeValue("src","");
            }
Copy the code

Once you get the address, you can use HTTPClient to download the image and save it to a folder

How do I modify node attributes?

For example, what if we upload an image to our server and then want to replace someone else’s image address in an article?

var imgNodes = detail.DocumentNode.Descendants("img"); Foreach (var img in imgNodes) {img.SetAttributeValue(" SRC ", "image address "); }Copy the code

Basically, mastering these points, you can go around collecting other people’s websites.

Of course, HtmlAgilityPack has far more functions than those described in this article. More functions may need more in-depth requirements before they are used.

Do not understand or need to communicate the big guys, you can add my QQ: 862640563 QQ group: 545594312

Net Core uses HtmlAgilityPack to parse and collect Internet data

1. Download the web page

2. Parse the web page

How do I parse a list?

How do I delete labels?

What if you want to take pictures?

How do I modify node attributes?

Related Posts

What is the programmer’s inner workings?

Bash game meaning in Chinese

Bit band operation