Often encounter some simple requirements, the need to climb some data on a website, but the structure of these pages is very simple, and the amount of data is relatively small, their own code can be achieved, but how to kill chicken with a knife?
At present, there are some mature zero-code crawler tools on the market, such as octopus, which have ready-made templates and can also define some crawler rules by themselves. But TODAY I’m going to introduce another Scraper, the Web Scraper, which is a Chrome extension that you can use directly in the F12 debugger.
1. Install the Web Scraper
If you can, you can install it by searching for the Web Scraper in the store
Students without conditions, you can come to this website (crxdl.com/) download CRX files, and then offline installation, specific methods can use search engines to solve
Once installed, you’ll need to restart Chrome once, and F12 will see the tool
2. Basic concepts and operations
Before I use the Web Scraper, I need to explain some basic concepts:
sitemap
Literally translated is the site map, with the map crawler can follow it to get the data we need.
Therefore, sitemap can be understood as a crawler of a website. To crawl data of multiple websites, multiple Sitemaps must be defined.
Sitemap is export and import enabled, which means that your sitemap can be shared with others.
As you can see from the following figure, the Sitemap code is a string of JSON configurations
Once you get this configuration you can import someone else’s Sitemap
Selector
To extract data from an HTML page full of data, we need a selector to locate the specific location of our data.
Each Selector can get one data, and to get multiple data you need to locate multiple selectors.
The Web Scraper provides a number of selectors, but this article introduces just a few of the most frequently used and most widely covered selectors. After you’ve learned one or two, the others are pretty much the same, and you’ll quickly get started with a little more personal reading.
The Web Scraper uses a CSS selector to locate the element, and if you don’t know it, it’s ok. In most scenarios, you can just click on the element, and the Web Scraper automatically resolves the CSS path.
Selectors can be nested, and the CSS Selector scope of the child Selector is the parent Selector.
It is this endless nesting that allows us to recursively crawl through the data of the entire site.
Here is the selector topology that we’ll show you a little bit more visually on the crawler logic of the Web Scraper
Data crawl and export
Once you’ve defined your Sitemap rules, click what Scrape to start crawling the data.
The data won’t be displayed on the page immediately after the crawl, and you’ll need to manually click the Refresh button to see the data.
Finally, the data can also be exported as CSV or XLSX files.
3. Pager crawl
The classic model for crawling data is lists, pages, and details, and I’ll go in that direction, crawling CSDN blog posts to show you how to use a few selectors.
Pager can be divided into two types:
- One is that clicking on the next page reloads a page
- One is that clicking on the next page only re-renders part of the current page
In earlier versions of the web-scraper, the crawling methods were different.
- For pages that need to be reloaded, a Link selector is required
- For pages that do not need to be reloaded, use the Element Click selector
It is enough for some websites, but it has great limitations.
After my experiment, the first principle of using Link selector is to take out the hyperlink of the NEXT page’s A tag and then visit it, but not all websites’ next page is realized through a tag.
If you use JS to listen for events and then jump, like the following, you cannot use the Link selector.
The new version of the Web scraper, on the other hand, offers special support for the navigation pager, plus a Pagination selector, which works in both scenarios, as I’ll demonstrate below.
Pager crawl without reloading a page
Click on a specific CSDN blog post and scroll down to the bottom to see the comments section.
If your post is popular and there are many comments, CSDN will page-display it, but no matter which page of comments, they belong to the same article, when you browse the comments section of any page, there is no need to refresh the post, because this page-display does not reload the page.
For this kind of Click without reloading the page, you can use Element Click.
Last but not least, select root and next_page so that you can recursively crawl
The final result of the crawl is as follows
Using Element Click a sitemap configuration is as follows, you can directly import study my configuration (config file download: wwe.lanzoui.com/iidSSwghkch…
Of course, for things like paging, the Web scraper provides a more specialized Pagination selector, which is streamlined and works best
The corresponding sitemap configuration is as follows, you can use (direct import configuration file download: wwe.lanzoui.com/iidSSwghkch…
Pager crawl to reload a page
CSDN’s list of blog posts, scroll down to the bottom, click the specific page button, or click the next page on the far right to reload the current page.
With this pager, Element Click doesn’t work, and the reader can verify that it closes after only one page is climbed.
As a Pagination selector for Pagination, this is naturally appropriate
The topology of the climb is the same as above, and will not be described here.
The corresponding sitemap configuration is as follows, you can directly import to learn (configuration file download: wwe.lanzoui.com/iidSSwghkch…
4. Crawl the secondary page
CSDN’s blog list page shows rough information, including title, publication time, number of views, number of comments and whether it is original.
To get more information about a blog post, such as the body of the post, the number of likes, favorites, and the content of the comments section, click on the link to see the specific post
The logic of the Web scraper is humanlike, and to retrieve more details from the post, you need to open a new page, and the Web scraper Link selector does just that.
The crawl path topology is as follows
The effect of the crawl is as follows
A sitemap configuration is as follows, you can use (direct import configuration file download: wwe.lanzoui.com/iidSSwghkch…
5. Write at the end
Above combs the paging and secondary page crawl scheme, mainly is: paging and secondary page crawl.
With these two, you should be able to handle most structured web data.
For example, you can crawl all the information of your blog posts on CSDN, including title, link, article content, number of views, number of comments, number of likes and number of favorites.
Of course, to use the web scraper, a zero code scraper, you may need some basics, such as:
- CSS selector knowledge: how to grab an element’s attributes, how to grab the NTH element, how to grab a specified number of elements?
- Regular expression knowledge: How to do the preliminary processing of the content of the crawl?
I’ll try to cover the core of the Web scraper for space, but I’ll leave the rest of the basics up to you.