Programming is not an easy thing for any novice, Python is indeed a blessing for anyone who wants to learn programming, reading Python code is like reading an article, because Python language provides very elegant syntax, known as one of the most elegant languages.

When the python primer

The most commonly used crawler scripts,

I have written the script of capturing the agent’s local verification, and written the script of automatic login and automatic Posting in the forum

Have written the script of automatic mail receipt, have written the script of simple verification code recognition.

One thing these scripts have in common is that they are all web-related,

We always use some methods to get links, so we have accumulated a lot of experience in crawler catching.

In this summary, so later do things do not have to repeat the labor.

Most of the time follow books and websites to find information learning, will find that there is no goal, learn a lot but do not know what they can make in the end. To have a clear plan for your career, you can visit our Python Learning Community [784758214] for basic and advanced learning. From hiring requirements to how to learn Python, and what to learn, the free system allows you to learn by yourself or by training. I hope this will help you learn Python quickly

1, basic crawl web pages

The get method

Post method

2. Use a proxy server

This can be useful in some situations,

For example, the IP address is blocked, or the number of IP access is limited, etc.

3. The Cookies

Yes, that’s right. If you want to use both proxy and cookie,

Proxy_support and operner as follows:

4. Pretend to be a browser

Some websites are disgusted with the visit of crawlers, so they reject all requests to crawlers.

At this point we need to disguise ourselves as a browser,

This can be done by modifying the header in the HTTP package:

5. Page parsing

The most powerful tool for page parsing is, of course, regular expressions,

This is not the same for different users of different websites, there is no need to explain too much.

Then there are parsing libraries, such as LXML and BeautifulSoup.

My assessment of these two libraries is,

Beautifulsoup is a pure Python implementation that is inefficient,

But the function is practical, such as through the result search to obtain the source of an HTML node;

LxmlC language encoding, efficient, Xpath support.

6. Verification code processing

What should I do if I encounter a captcha?

There are two cases:

Google-style captcha, no way.

Simple captcha: a code with a limited number of characters that uses only simple panning or rotation plus noise without distortion,

It’s possible to deal with that, and the general idea is to rotate it, rotate it back, remove the noise,

Then divide a single character, and then reduce dimension through feature extraction method (such as PCA) and generate feature database.

The verification code is then compared with the signature database.

This is a little bit more complicated, so I’m not going to expand it here,

Specific practice does this relevant textbook to study well please.

  1. Gzip/deflate support

Web pages now generally support GZIP compression, which tends to solve a lot of transfer time,

VeryCD home page as an example, uncompressed version 247K, compressed after 45K, for the original 1/5.

That means grabbing is five times faster.

However, python urllib/urllib2 does not support compression by default

To return the compressed format, you must specify ‘accept-encoding’ in the request header,

Then, after reading the response, it is necessary to check the header to see if there is a ‘content-encoding’ item to determine whether the decoding is needed, which is very tedious and trivial.

How to make urllib2 automatically support gzip, defalte?

You can actually extend the BaseHanlder class,

Then build_opener way to handle:

8, multi-threaded concurrent fetching

If a single thread is too slow, you need multiple threads,

Here’s a simple thread pool template and this program simply prints 1-10,

But you can see that it’s concurrent.

Even though it’s lame to say that Python is multithreaded

But for a crawler with a high network frequency,

It can still improve efficiency to a certain extent.

  1. conclusion

Reading code written in Python feels like reading English, allowing users to focus on solving problems rather than understanding the language itself.

Python is written in C, but it does away with C’s complicated Pointers, making it simple and easy to learn.

And being open source, Python allows code to be read, copied, and even improved upon.

These features make Python so efficient, “Life is short, I use Python,” that it is a wonderful and powerful language.

To sum up, there are four things to remember when you start learning Python:

Code specification, which is a very good habit in itself, can be painful if you don’t have good code planning at first. 2. More hands-on, less reading, many people learn Python blindly reading books, this is not learning mathematics and physics, you may see the examples, learning Python is mainly learning programming ideas.

3 practice frequently, after learning new knowledge points, we must remember how to apply, or we will forget after learning, learning our line is mainly practical operation.

4. Be productive. If you feel inefficient, stop, find out why, and ask someone who has been there.

The article reprinted

Author: Abuse the heart

The original link: zhuanlan.zhihu.com/p/337082605