Day 1 | 12 Days to get a Python web crawler

Beautiful small MM of human resources department, run to ask me: Old Chen, data analysis and crawler is the relationship after all? To be honest, I really don’t want to talk to her, because I always think it has little to do with her work, but when I think of her in charge of the recruitment work of my department, I have to reluctantly tell her: data analysis, eat inside, crawler, climb outside, together is eat inside climb outside.

Era of big data, to perform data analysis, the first thing to have a data source, the company that several drizzle alone (data), analysis of loneliness is not enough, only through learning the crawler, from the outside (site) take up some relevant and useful data, in order to let the boss of business decisions according to the can depend on, and you, also is the boss.

A mention of the boss, beautiful small MM, excited to get up, immediately loudly ask: you IT bound, the most handsome is that make the li boss of search engine?

Although I am a little unconvinced, a little unhappy, but HOW can I get, after all, in the network crawler, he (Boss Li) technology than really strong. He knows how to use crawler technology, every day in the mass of Internet information to crawl, crawl high-quality information and included in the database he set. When the user enters the keyword in the search engine, the engine system will carry on the data analysis processing to the keyword, find out the relevant webpage from the included webpage, sort according to certain ranking rules and show the result to the user.

At the thought of ranking to earn money, Li boss a point all don’t give me, I told manpower MM: good, don’t pull shit with you, I want to tell my old iron said the principle of the network crawler, you eat inside and crawl outside the guy, see your boss go.

1. What is a reptile

Web crawler is also known as web spider, web ant, web machine, etc. It crawls data on the network according to the rules formulated by us. The result will be HTML code, JSON data, images, audio, or video. According to the actual requirements, the programmer filters the data, extracts the useful ones and stores them.

To put it simply, it is to use Python programming language to simulate a browser, visit a specified website, return results to it, filter and extract the data you need according to the rules, and store it for use.

Read my “the 10th day | 12 days Python, file operations” and “the 11th day | 12 days Python, database operations,” brother, should know that data often exists file or database.

2. Crawl process

Users access network data through the browser: open the browser -> enter the URL -> submit the browser request -> download web code -> parse into a page.

Crawler programming, specify the url, simulate the browser to send a request (get web code)-> extract useful data -> store in a file or database.

! [](https://p3-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/b3d0e55870e648da8c6146a45da61ac6~tplv-k3u1fbpfcp-watermark.image)

Python is recommended for crawler programming because the Python crawler library is simple and easy to use, which can meet most functions in the Python built-in environment. It can:

(1) use HTTP library to initiate a Request to the target site, that is, send a Request(including Request header and Request body, etc.);

(2) The built-in library (HTML, JSON, regular expression) is used to parse the Response returned by the server

(3) Store the required data in files or databases.

If Python’s built-in libraries are not enough, you can use the PIP Install library name to quickly download a third library and use it.

3. Location of climbing points

In the process of writing crawler code, it is often necessary to specify the node or path to climb. If I told you that you could get nodes or paths quickly using Chrome, would you immediately check to see if your computer was installed?

Yes, that’s right. No, go ahead and install it.

On the page, press the keyboard key F2 to display the source code. Select the node you want to get, right-click [check] to locate the code, right-click the code, select [Copy] – [Copy Selector] or [Copy XPath] to Copy the content of the node or path.

Well, the content of the principle of crawler, Chen finished, if you feel helpful, I hope old iron can forward points like, so that more people see this article. Your retweets and likes are the biggest encouragement for Lao Chen to continue to create and share.

An old guy who’s been cTO for 10 years, sharing years of programming experience. Want to learn programming friends, can pay attention to today’s headlines: Chen said programming. I’ll share some good stuff about Python, front-end (applets), and apps.

Follow me, that’s right.

Day 1 | 12 Days to get a Python web crawler

1. What is a reptile

2. Crawl process

3. Location of climbing points

Related Posts

Top 10 Python Open Source projects

Have you settled your tech debt? Try to reconstruct

Modular, structured code, programmers are telling the story of life