Question 1: How to deal with dynamic loading and high requirements for timeliness?

How do I know if a website is dynamically loaded with data? Open your web page in Firefox or Google Chrome, right-click the source code, CTRL +F query the input content, the source code does not have this value, indicating dynamic loading data.

  1. Selenium+Phantomjs
  2. Use WebDriverWait instead of sleep

2. What are the common frameworks for Python crawlers?

The serial number Name of the framework describe website
1 Scrapy Scrapy is an application framework designed to crawl site data and extract structured data. It can be used in a range of applications including data mining, information processing or storing historical data. This framework makes it easy to climb down data such as Amazon product information. scrapy.org/
2 PySpider Pyspider is a powerful web crawler system implemented by Python. It can write scripts on the browser interface, schedule functions and view the crawl results in real time, store the crawl results using common databases at the back end, and also set tasks and task priorities regularly. Github.com/binux/pyspi…
3 Crawley Crawley can crawl the contents of corresponding websites at high speed, support relational and non-relational databases, and export data to JSON and XML. project.crawley-cloud.com/
4 Portia Portia is an open source visual crawler that allows you to crawl websites without any programming knowledge! Simply annotate the pages you are interested in, and Portia will create a spider to extract data from similar pages. Github.com/scrapinghub…
5 Newspaper Newspaper can be used to extract news, articles, and content analysis. Use multithreading, support more than 10 languages, etc. Github.com/codelucas/n…
6 Beautiful Soup Beautiful Soup is a Python library that extracts data from HTML or XML files. It allows you to navigate, find, and modify documents in your favorite converter. Beautiful Soup can save you hours or even days at work www.crummy.com/software/Be…
7 Grab Grab is a Python framework for building Web scrapers. With Grab, you can build a variety of complex web scraping tools, from simple 5-line scripts to complex asynchronous web scraping tools that handle millions of web pages. Grab provides an API for performing network requests and processing received content, such as interacting with the DOM tree of AN HTML document. Docs.grablib.org/en/latest/#…
8 Cola Cola is a distributed crawler framework that allows users to write a few specific functions without having to worry about the details of distributed execution. Tasks are automatically distributed across multiple machines and the process is transparent to the user. Didn’t find it ~
9 A lot of Watch yourself accumulate Baidu more

3. What are the advantages and disadvantages of Scrapy?

Pros: Scrapy is asynchronous

Adopt more readable xpath to replace the powerful statistics and log system of re, and support shell mode to crawl on different urls. It is convenient to debug and write middleware independently, and it is convenient to write some uniform filters, which can be piped into the database.

Disadvantages: Python based crawler framework, poor scalability

Based on the Twisted framework, a running Exception does not kill the Reactor, and an asynchronous framework error does not stop other tasks, so data errors are hard to detect.

4. Scrapy and Request?

  • Scrapy is a framework that is encapsulated, including downloaders, parsers, logs and exception handling. It is based on multithreading and twisted. It has advantages for the development of a fixed single site crawl, but it is not flexible enough for multi-site crawl, concurrent and distributed processing, and inconvenient to adjust and include.

  • Request is an HTTP library, it is only used to make requests, for HTTP requests, it is a powerful library, download, parse all their own processing, high flexibility, high concurrency and distributed deployment is also very flexible, for the function can be better realized.

5. Describe how scrapy frames work.

  1. Get the first urls from start_urls and send the request. The request is sent by the engine to the scheduler and put into the request queue. After obtaining the request, the scheduler sends the request from the request queue to the downloader to obtain the corresponding response resource of the request, and gives the response to its own parsing method for extraction processing. Is handed to the pipeline file processing;
  2. If the URL is extracted, the previous steps continue (sending the URL request and having the engine queue the request to the scheduler…). Until there are no requests in the request queue and the program ends.

Question 6: What are the ways to implement simulated login?

  • Using a cookie with login status, sent along with the request header, you can send a GET request directly to a page that can only be accessed after login.
  • First send the GET request of the login interface, obtain the required data (if necessary) in the HTML of the login page, and then send the POST request combined with the account password. Then the login succeeds. Then, based on the obtained cookie information, continue to visit the subsequent page.

Question 7: What anti-crawler strategy have you encountered?

  1. BAN IP
  2. BAN USERAGENT
  3. BAN COOKIES
  4. Captcha verification
  5. Javascript rendering
  6. Ajax asynchronous transmission
  7. Such as…

Follow-up Interview Questions

  • What’s your favorite anti-crawler solution?
  • Have you used multithreading and asynchrony? What other methods have you used to improve crawler efficiency?
  • Have you ever done incremental fetching?
  • Do you know anything about the Python crawler framework?

At the age of 27, she learned C, C ++ and Python programming languages from scratch. At the age of 29, she wrote 100 tutorials. At the age of 30, she mastered 10 programming languages

Welcome to pay attention to her public number, non-undergraduate programmer