Distributed implementation of distributed crawler principle

Next, we will use scrapy-Redis to implement distributed docking.

First, preparation

Please ensure that you have successfully implemented Scrapy, and that the scrapy-Redis library is installed correctly.

Ii. Build the Redis server

To realize distributed deployment, multiple hosts need to share the crawl queue and decollection, which are both stored in the Redis database. Therefore, we need to build a Redis server accessible to the public network.

Linux server is recommended. Cloud hosts provided by Alibaba Cloud, Tencent Cloud, Azure, etc., are generally equipped with public IP addresses. For the specific construction method, refer to the Redis database installation method in Chapter 1.

Redis can be remotely connected after the installation is complete. Note that some merchants (such as Ali Cloud, Tencent Cloud) need to configure the security group to put through the Redis running port to remotely access the servers. If the remote connection fails, check the security group Settings.

The running IP address, port and address of Redis need to be recorded for the later configuration of distributed crawler. Currently, the configured IP address of Redis is 120.27.34.25, the default port is 6379, and the password is foobared.

3. Deploy the proxy pool and Cookies pool

Sina Weibo project needs to use the agent pool and Cookies pool, and our agent pool and Cookies pool were run locally before. Therefore, we need to run the two on a server that can be accessed by the public network, upload the code to the server, modify the connection information configuration of Redis, and run the agent pool and Cookies pool in the same way.

Remote access to the interface provided by the proxy pool and the Cookies pool to obtain random proxies and Cookies. If remote access is not possible, make sure it runs on Host 0.0.0.0 before checking the security group configuration.

For example, the IP addresses of the agent pool and the Cookies pool that I have configured are the IP addresses of the server, 120.27.34.25, and the ports are 5555 and 5556 respectively, as shown in the figure below.

Next we will modify the Scrapy Sina Weibo project access link, as follows:

PROXY_URL = 'http://120.27.34.25:5555/random'
COOKIES_URL = 'http://120.27.34.25:5556/weibo/random'Copy the code

The modification mode depends on the actual IP address and port number.

4. Configure scrapy-Redis

Setting up scrapy-Redis is as simple as modifying the settings.py configuration file.

1. Core configuration

First and foremost, replace the scheduler classes and deduplicated classes with scrapy-Redis classes. Add the following configuration to settings.py:

SCHEDULER = "scrapy_redis.scheduler.Scheduler"
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"Copy the code

2. Configure the Redis connection

Next, configure the connection information of Redis. There are two ways to configure the connection information.

The first is through connection string configuration. We can construct a Redis connection string using the Redis address, port, and password. The supported connection forms are as follows:

redis://[:password]@host:port/db rediss://[:password]@host:port/db unix://[:password]@/path/to/socket.sock? db=dbCopy the code

The default value is 0, and the default value is 0. The default value is 0. The default value is 0, and the default value is 0.

Using my Redis connection information mentioned above, construct the Redis connection string as follows:

Redis: / / : foobared@120.27.34.25:6379Copy the code

Set it to REDIS_URL in settings.py:

REDIS_URL = 'redis: / / : foobared@120.27.34.25:6379'Copy the code

The second configuration method is to configure items separately. For example, based on my Redis connection information, I can configure the following code in settings.py:

REDIS_HOST = '120.27.34.25'
REDIS_PORT = 6379
REDIS_PASSWORD = 'foobared'Copy the code

This code configures the Redis address, port, and password separately.

Note that if REDIS_URL is configured, scrapy-Redis will use the REDIS_URL connection first, overriding the above three configurations. If you want to configure items separately, do not configure REDIS_URL.

In this project, I chose to configure REDIS_URL.

3. Configure the scheduling queue

This configuration is optional and PriorityQueue is used by default. If you want to change the configuration, you can configure the SCHEDULER_QUEUE_CLASS variable as follows:

SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue'
SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.FifoQueue'
SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.LifoQueue'Copy the code

Select one of the preceding three lines to switch the storage mode of the crawl queue.

In this project there is no configuration, we use the default configuration.

4. Configure persistence

This configuration is optional and defaults to False. By default, scrapy-Redis clears the scrapy-queue and the defingerprint set after all the scrapy-redis have done.

If you do not want to automatically empty the crawl queue and the defingerprint set, you can add the following configuration:

SCHEDULER_PERSIST = TrueCopy the code

When SCHEDULER_PERSIST is set to True, the climb queue and the deduplicated fingerprint collection will not be automatically cleared after the climb is complete. If not configured, the default value is False, which automatically clears.

It is worth noting that the crawl queue and the deduplicated fingerprint collection will not be automatically cleared if the crawler is forcibly interrupted.

In this project there is no configuration, we use the default configuration.

5. Configure the crawl function

This configuration is optional and defaults to False. If persistence is configured or the crawler is forcibly interrupted, the crawl queue and fingerprint collection will not be emptied, and the crawler will restart and resume the last crawl. If we want to crawl again, we can configure the crawl option:

SCHEDULER_FLUSH_ON_START = TrueCopy the code

When SCHEDULER_FLUSH_ON_START is set to True, the crawl queue and fingerprint collection will be cleared each time the crawler is started. Therefore, to do distributed crawl, we must ensure that it can only be emptied once. Otherwise, if every crawler task is emptied once at startup, the previous crawl queue will be emptied, which will inevitably affect distributed crawl.

Note that this configuration is convenient for single-node crawlers and is not commonly used for distributed crawlers.

In this project there is no configuration, we use the default configuration.

6. Pipeline configuration

This configuration is optional and Pipeline is not started by default. Scrapy-redis implements an Item Pipeline that stores items in the Redis database. If this Pipeline is enabled, the Item will be stored in the Redis database. In the case of large data volume, we generally do not do this. Since Redis is memory based, we are taking advantage of its fast processing and it is too wasteful to use for storage. The configuration is as follows:

ITEM_PIPELINES = {
    'scrapy_redis.pipelines.RedisPipeline': 300}Copy the code

This project does not perform any configuration, that is, do not start Pipeline.

At this point, scrapy-Redis configuration is complete. Some of these options are not configured, but may be used in other Scrapy projects, depending on the situation.

Configure storage targets

Before Scrapy Sina Weibo crawler project used storage is MongoDB, and MongoDB is local run, that is, connected to localhost. However, when the crawler is distributed to each host to run, the crawler will connect to the respective MongoDB. Therefore, we need to install MongoDB on each host, which has two disadvantages: first, it is complicated to build the MongoDB environment; Second, in this way, the crawlers of each host will disperse the crawling results to their own hosts, which is not convenient for unified management.

Therefore, it is better to store the storage targets in the same place, such as the same MongoDB database. We can set up a MongoDB service on the server, or directly purchase MongoDB data storage service.

The MongoDB service set up on the server is used here. The IP address is still 120.27.34.25, the user name is admin, and the password is admin123.

Configure MONGO_URI as follows:

MONGO_URI = 'mongo: / / admin: admin123@120.27.34.25:27017'Copy the code

So far, we’ve successfully configured a Scrapy distributed crawler.

Six, run,

Next, deploy the code to each host, each with its own Python environment.

Run the following command on each host to start the crawl:

scrapy crawl weibocnCopy the code

After this command is enabled on each host, requests will be dispatched from the configured Redis database to climb the queue share and fingerprint collection share. At the same time, each host occupies its own bandwidth and processor, will not affect each other, crawl efficiency doubled.

The results,

After a period of time, we can use the RedisDesktop to view information from a remote Redis database. There are two keys: one called weibocn: dupeFilter, which is used to store fingerprints; The other, called weibocn: Requests, crawls the queue, as shown in the figure below.

As time goes by, the fingerprint set will continue to grow, the crawl queue will change dynamically, and the crawl data will also be stored in the MongoDB database.

Viii. Code for this section

This section of code address is: https://github.com/Python3WebSpider/Weibo/tree/distributed, note that there is a distributed branches.

Nine, epilogue

This section successfully realizes distributed crawler by docking scrapy-Redis, but there are still many inconvenient places to deploy. Redis memory is also an issue if the crawls are particularly large. We will continue to learn about the optimization scheme in the following article.

This resource starting in Cui Qingcai personal blog still find: Python3 tutorial | static find web crawler development practical experience

For more information about crawlers, please follow my wechat official account: Coder

Weixin.qq.com/r/5zsjOyvEZ… (Qr code automatic identification)

Distributed implementation of distributed crawler principle

First, preparation

Ii. Build the Redis server

3. Deploy the proxy pool and Cookies pool

4. Configure scrapy-Redis

1. Core configuration

2. Configure the Redis connection

3. Configure the scheduling queue

4. Configure persistence

5. Configure the crawl function

6. Pipeline configuration

Configure storage targets

Six, run,

The results,

Viii. Code for this section

Nine, epilogue

Related Posts

Interviewer: How many types of nodes does ZooKeeper have? Stop saying four!

How to play the GO log?

A look at Docker process management from an example