Crawlers are widely used in work and life. They are very useful in paper data preparation, market research and work tools, etc. Crawlers are mostly targeted at web pages or APPS, so they will be introduced separately from these two categories today.
Crawler outline:
1. Webpage crawling
For web pages, divide them into two categories
- Server side rendering
- Client-side rendering
2. The App crawl
For App, I also divided the interface form into four categories
-
Common interface
-
Encryption parameter interface
-
Encrypted content interface
-
Unconventional protocol Interface
A web page crawl
Server-side rendering means that the result of the page is rendered by the server and returned, and the valid information is contained in the requested HTML page, such as the cat’s Eye movie site. Client-side rendering means that the main content of the page is rendered by JavaScript, and the real data is obtained through Ajax interfaces, such as Taobao, the most right mobile version, and other sites.
Server rendering is relatively simple, using some basic HTTP request libraries to achieve crawling, such as URllib, URllib3, Pycurl, Hyper, Requests, Grab, and other frameworks, of which the most application is probably requests.
For client rendering, there are four approaches:
- Look for Ajax interfaces. In this case, you can use Chrome/Firefox developer tools to directly view Ajax specific request methods, parameters, and other content, and then use HTTP request library simulation. In addition, you can set up proxy capture to view the interface, such as Fiddler/Charles.
- Simulated browser execution, this situation is suitable for the web interface and logic is more complex, can be directly visible to climb the way to climb, You can use Selenium, Splinter, Spynner, Pyppeteer, PhantomJS, Splash, requests- HTML, etc.
- Direct extraction of JavaScript data is a case where the real data is not fetched through an Ajax interface, but is contained directly in a variable of the HTML result, and can be extracted using regular expressions.
- Simulating the execution of JavaScript, in some cases, directly simulating the execution efficiency of the browser will be low. If we figure out some execution and encryption logic of JavaScript, we can directly execute the relevant JavaScript to complete logic processing and interface requests. For example, use Selenium, PyExecJS, PyV8, JS2py, and other libraries.
Two, App crawl
For App crawls, there are four processing cases:
- For common unencrypted interfaces, the specific request form of direct packet capture to the interface is good. Available packet capture operators include Charles, Fiddler, and MitmProxy.
- For encryption parameters of the interface, a method can be real-time processing, such as Fiddler, MitMDump, Xposed, etc., another method is to crack the encryption logic, direct simulation can be constructed, may need some decompression skills.
- For the encryption content of the interface, that is, the interface return results completely do not understand what is, you can use the visible climbing tool Appium, can also use the Xposed hook to obtain the rendering results, can also be through decompilation and rewrite the phone bottom to achieve cracking.
- For unconventional protocols, you can use Wireshark to capture packets of all protocols or Tcpdump to capture TCP packets.
Three, parsing,
Parsing. For HTML pages, there are several common parsing methods, such as re, XPath, and CSS Selector. In addition, for some interfaces, common may be JSON and XML types, which can be processed by using corresponding libraries.
These rules and parsing methods are actually very tedious to write, if we have to climb tens of thousands of websites, we can use intelligent parsing, which means that if you can provide a page, the algorithm can automatically extract the page title, body, date and other content, while removing useless information.
Intelligent analysis can be divided into four methods:
- Readability algorithm, which defines different sets of annotations for different blocks, calculates weights to get the most likely block location.
- Sparse density judgment, calculate the average length of text content in a block of unit number, and roughly distinguish according to the degree of density.
- Scrapyly self-learning, is a component developed by Scrapy, specify page surface and extract sample results, which can be self-learning extraction rules, extraction of other similar pages page surface.
- Deep learning, the use of deep learning to perform line by line supervised learning of parsed locations, requires large amounts of annotated data.
Four, storage,
Storage, that is, the selection of appropriate storage media to store the results of crawling, here is divided into four storage methods to introduce:
- CSV, CSV, TXT, image, video, audio, etc. Some commonly used libraries include CSV, XLWT, JSON, PANDAS, pickle, python-docx, etc.
- Databases are classified into relational databases and non-relational databases, such as MySQL, MongoDB, HBase, etc. Common libraries include Pymysql, PyMSSQL, Redis-py, PyMongo, Py2neo, and Thrift.
- Search engines, such as Solr and ElasticSearch, are easy to search and implement text matching. Common libraries include ElasticSearch and PySolr.
- Cloud storage, some media files can be stored in qiniu cloud, Youpaiyun cloud, Ali Cloud, Tencent Cloud, Amazon S3, etc. Common libraries include Qiniu, Upyun, Boto, Azure – Storage, Google -cloud-storage, etc.
The key to this part is to connect with the actual business and see which approach best meets the business needs.
Five, reverse crawling means of treatment
Anti-crawler is a key part, crawler is now more and more difficult, many websites have added a variety of anti-crawler measures, here can be divided into non-browser detection, IP blocking, verification code, account blocking, font anti-crawler and so on. The following mainly from the IP address, verification code, account sealing three aspects to explain the anti-crawl processing methods.
1. Prevent IP address blocking
For the case of sealed IP, it can be divided into several cases:
- First look for mobile phone sites, App sites, if there is such a site, anti-crawl will be relatively weak.
- Use proxies, such as fetching free proxies, buying paid proxies, using Tor proxies, Socks proxies, etc.
- Maintain your own agent pool on the basis of agents to prevent agent waste and ensure real-time availability.
- Set up ADSL dial-up agent, stable and efficient.
2. The verification code
There are many kinds of verification codes, such as common graphics verification code, arithmetic verification code, sliding verification code, tap verification code, mobile verification code, scan two-dimensional code, etc.
For ordinary graphic verification code, if very regular and no deformation or interference, can use OCR recognition, can also use machine learning, deep learning to carry out model training, of course, the coding platform is the most convenient way.
- For arithmetic captcha, it is recommended to use the coding platform directly.
- For sliding captchas, cracking algorithms can be used, or sliding can be simulated. The latter is the key to find the gap, you can use the image comparison, can also write the basic graph recognition algorithm, can also docking coding platform, can also use deep learning training recognition interface.
- You are advised to use a coding platform for tapping verification codes.
- For mobile verification codes, you can use a verification distribution platform, purchase a special code collection device, or manually verify the code.
- For scanning the TWO-DIMENSIONAL code, you can manually scan the code, but also can docking code platform.
3. Avoid closing accounts
Some websites require login to be accessed. However, if an account requests too frequently after login, it will be blocked. To avoid being blocked, you can take the following measures:
- Look for mobile sites or App sites, which are usually in the form of interfaces and have weak verification.
- Find an interface that does not require login. If possible, find an interface that can be climbed without login.
- Maintain the Cookies pool, use the batch account simulated login, use the random selection of available Cookies can be used, the implementation: github.com/Python3WebS…
Six, improve the speed of climbing
When the amount of data to be crawled is very large, how to grab data efficiently and quickly is the key. Common measures are multithreading, multi-process, asynchronous, distributed, detail optimization and so on.
1. Multi-threading, multi-process
Crawler is a network request intensive task, so the efficiency of crawler can be greatly improved by using multi-process and multi-thread, such as threading and multiprocessing.
2. The asynchronous
Change the crawl process to a non-blocking form and process it when there is a response, otherwise other tasks can run during the waiting time, Examples include Asyncio, AIoHTTP, Tornado, Twisted, gEvent, Grequests, Pyppeteer, PySpider, Scrapy, etc.
3. The distributed
The key to distributed tasks is to share tasks in queues, using tasks like celery, huey, rq, RabbitMQ, kafka, etc., as well as ready-made frameworks like PySpider, scrapy-redis, scrapy-cluster, etc.
4. The optimization
Certain optimization measures can be taken to achieve crawl acceleration, such as:
- DNS cache
- Use faster parsing methods
- Use a more efficient way to remove weight
- Module separation control
5. Architecture
If the distributed system is set up, we can use two architectures to maintain our crawler project in order to realize efficient crawler and manage scheduling and monitoring operations.
- Package Scrapy items as Docker images and use K8S to control the scheduling process.
- Deploy your Scrapy projects to Scrapyd and manage them using dedicated management tools such as SpiderKeeper and Gerapy.
That’s the end of it. The benefits of a crawler are only as good as you can get. For more Python highlights, follow me.