Crawler is interesting, many students are learning crawler, in fact, crawler learning has a certain cost, need to consider static and dynamic web pages, there are a bunch of libraries to master, complex need to use scrapy framework, or selenium crawling, and even consider anti-crawling strategy. If you don’t crawl data very often, but use it occasionally, there is a magic device that can crawl data very quickly, and it works very well. Today we’re going to introduce WebScrapy.
Install WebScrapy # # #
WebScrapy is a Google Chrome add-on that costs very little to install compared to other third-party data collectors. Go to WebScrapy and download it from the Chrome store.
Once installed, you’ll see a small spider web icon on your Chrome browser
### Climb the Python library on Github
Github is famous for having lots of fun libraries, with hundreds of thousands of repOs. There are many ways to crawl, but using webscrape is very easy and can be done in a few minutes! By contrast, writing a code to crawl would take a very skilled hand half an hour to complete, and the code would need debugging. It can be seen that the artifact is really very convenient and the cost is very low. Let’s explain it step by step:
#####1. Target website analysis
Github’s site is very simple and doesn’t have any anti-crawler policies. Searching for Python on the main page will take you to the Python-related topics page
#####2. Url analysis There are a number of categories, such as Code, we select one of them to view:
Once you know the url rules above, all you need to do is change the parameters on other pages, such as page 2, github.com/search?p=2&… .
#####3. Right click on Webscrapy and select Check, which brings us to our properties page review element, a webscraper menu at the end, and Create a Sitemap
Github’s Python libraries are organized into large lists, and webscrapy supports many types of crawling of different web elements, such as text, hyperlinks, images, elements, and so on.
1). Add a selector select our Sitemap is Github and start adding a selector
2.) to create the item
3). Select the title, time, and number of stars in item
4). Create a heading under item
The process is very similar to creating an item, Select Text for Type, and then click “Selector” to select the title from the orange box above, then click “Done Selectors” and save Selector, that’s going to come in handy.
5). Similarly, we select description of library, how many, time elements!
#####5. Start taking What he says is ok and we can start the enjoyable crawling, just click on the Scrape under Sitemap. Then a Request nterval ms of 2 seconds and Page load delay ms of 500 will pop up, and we will use the default parameters
WebScrapy will start at page 100 and work your way up to page one. You can pour a cup of tea and wait for the result. It takes a few minutes to get to page 100.
The results are in memory, and we need to save them to a file. Webscraper does that for us. Click on Export Data as CSV in Sitemap, and it will automatically generate a github
### Conclusion: You only need to construct multiple urls can climb tens of thousands of libraries, Webscrapy for the market to climb 80% of the page is very convenient and simple, do not write a line of code, in minutes!
There are many including video I will not be a screenshot, need these information can pay attention to small, forward comments, private xiaobian 008 or 006