The text and pictures in this article come from the network, only for learning, exchange, do not have any commercial purposes, copyright belongs to the original author, if you have any questions, please contact us to deal with

The following article is from The author of Tencent Cloud: Learning Python

(Want to learn Python? Python Learning exchange group: 1039649593, to meet your needs, materials have been uploaded to the group file stream, you can download! There is also a huge amount of new 2020Python learning material.)

Summary of the Python crawler interview

1. Write a regular expression for an email address?

[A-Za-z0-9\u4e00-\u9fa5]+@[a-zA-Z0-9_-]+(\.[a-zA-Z0-9_-]+)+$
Copy the code

2. Tell us about Selenium and PhantomJS

Selenium is a Web automation testing tool that allows the browser to automatically load pages, retrieve needed data, or even take screenshots of pages, or determine whether certain actions are taking place on a Web site, according to our instructions. Selenium does not come with a browser of its own and does not support browser functionality. It needs to be used in conjunction with third-party browsers. But we sometimes need to make it run embedded in code, so we can use a tool called PhantomJS instead of a real browser. Selenium has an API called WebDriver. WebDriver is a bit like a browser that loads websites, but it can also be used like BeautifulSoup or any other Selector object to find page elements, interact with elements on the page (sending text, clicking, etc.), and perform other actions to run a web crawler.

PhantomJS is a WebKit-based “headless” browser that loads a website into memory and executes JavaScript on the page, and runs more efficiently than a full browser because it doesn’t display a graphical interface. It uses less resources than, say, a traditional Chrome or Firefox browser.

If we combine Selenium with PhantomJS, we can run a very powerful web crawler that handles JavaScrip, Cookie, headers, and anything else our real users need to do. Selenium does not guarantee that phantomJS will exit successfully after the main program exits. It is best to shut down the phantomJS process manually. (It may cause multiple phantomJS processes to run and consume memory). WebDriverWait may reduce latency, but it is currently buggy, in which case sleep can be used. PhantomJS is slow to crawl data and can be multithreaded. If some of them work and some don’t, try changing phantomJS to Chrome.

3. Why do requests requests require headers?

The reason is: impersonate the browser, trick the server, and get the same content header form as the browser: dictionary

headers = {"User-Agent": "Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36"}
Copy the code

Usage: requests. Get (url, headers = headers)

4. What anti-crawler strategies have you encountered? And what are the coping strategies?

Anti-crawler by headers

Crawler based on user behavior: for example, the same IP visit the same page for many times in a short period of time, or the same account for many times in a short period of time

Dynamic web anti-crawler, for example: the data we need to crawl is obtained through Ajax requests, or generated through JavaScript

Part of the data is encrypted, for example: the part of the data we want to capture can be captured, the other part of the data is encrypted, is garbled response strategy:

For basic web page fetching, you can customize headers, add headers data, proxy to solve the problem that some website data fetching must be simulated login to capture complete data, so it is necessary to conduct simulated login. For limited crawl frequency, you can set the crawl frequency to be lower; for limited IP crawl, you can use multiple proxy IP addresses; for dynamic web pages, Selenium + PhantomJS can be used for crawl, but it is slow, so you can also use interface lookup to crawl. Selenium can be used to take screenshots of part of the data that is encrypted. After dinner, Python's pytesseract library can be used to identify the data. However, the slower and most direct method is to find an encrypted method for backward reasoning.Copy the code

5. Distributed crawler principle?

Our core server is called master, and the machine used to run crawlers is called slave.

To use a scrapy framework to crawl a web page, we need to give it some starturls. The crawler first accesses the URL in the Starturls and then, according to our specific logic, crawls the elements in the starturls and other secondary and tertiary pages. To be distributed, all we need to do is work in this starts_urls.

We set up a redis database on master (note that this database is only used for url storage, not to be mixed with mongodb or mysql), and create a separate list field for each type of site that needs to be crawled. Set the slave scrapy-redis url to master. As a result, even though there are multiple slaves, there is only one place you can get the URL, and that is the Redis database on the server master. And, thanks to scrapy-Redis’ own queue mechanism, the slave gets no conflicting links. In this way, after each slave completes the fetching task, the obtained results are summarized to the server (at this time, the data store is no longer redis, but mongodb or mysql database storing specific content). Another advantage of this method is that the program is highly transplanted. As long as the path problem is properly handled, Porting slave programs to run on another machine is basically a copy-and-paste affair.

Pythoon2. x urllib and urllib2.

Similarities and differences: both do THE OPERATION of THE URL request, but the difference is obvious. Urllib2 accepts an instance of the Request class to set headers for a URL Request. Urllib accepts only urls. This means that you can't disguise your User Agent strings via the URllib module. Urllib provides the urlencode method to generate a GET query string, while urllib2 does not. This is why urllib is often used with urllib2. The module's comparative advantage is that urlliburllib2. Urlopen can accept a Request object as a parameter, thus controlling the header of the HTTP Request. However, urllib. urlRetrieve function and a series of quote and unquote functions such as urllib.quote are not added to URllib2, so urllib is sometimes required.Copy the code

7. What is the robots protocol?

Robots protocol (also known as crawler protocol, crawler rules, robot protocol, etc.) is also known as robots.txt. Websites tell search engines which pages can be captured and which pages cannot be captured by Robots protocol.

Robots protocol is a common code of ethics for websites in the international Internet community. Its purpose is to protect website data and sensitive information and ensure that users’ personal information and privacy are not violated. Because it is not a command, so the search engine needs to comply with consciously.

8. What is a reptile?

Crawlers are automated programs that request web sites and extract data

9. The basic process of crawlers?

1Send a Request to the target site through the HTTP library, that is, send a Request, which can contain additional headers information, and wait for the server to respond2If the server can respond normally, it will get a Response, and the content of the Response is the page content that is requested3, parsing content: regular expression, page parsing library, JSON4Save data: text or databaseCopy the code

10. What are Request and Response?

The local sends a Request to the server, the server returns a Response, and the page is displayed on the page

1The browser then sends a message to the server where the url is located, a process called an Http Request2After receiving the message sent by the browser, the server can process the message according to the content of the message sent by the browser, and then send the message back to the browser. This process is called HTTP Response3After receiving the Response message from the server, the browser processes the message and displays itCopy the code