Distributed crawler framework services

Available for github.com/zly-app/zap… The service of

instructions

Crawler.withservice () # RegistrySpider(...) Service injection spiderCopy the code

The sample

package main

import (
   "fmt"
   "github.com/zly-app/zapp"
   "github.com/zly-app/crawler"
   "github.com/zly-app/crawler/core"
)

/ / a spiders
type Spider struct {
   core.ISpiderTool // This interface must be inherited
}

/ / initialization
func (s *Spider) Init(a) error { return nil }

// Submit the initialization seed
func (s *Spider) SubmitInitialSeed(a) error {
   seed := s.NewSeed("https://www.baidu.com/", s.Parser) // Create the seed and specify the parsing method
   s.SubmitSeed(seed)                                    // Submit the seed
   return nil
}

// Parse the method
func (s *Spider) Parser(seed *core.Seed) error {
   fmt.Println(string(seed.HttpResponseBody)) Print the response body
   return nil
}

/ / close
func (s *Spider) Close(a) error { return nil }

func main(a) {
   app := zapp.NewApp("a_spider", crawler.WithService()) // Enable the Crawler service
   crawler.RegistrySpider(new(Spider))                   / / into the spiders
   app.Run()                                             / / run
}
Copy the code

configuration

  • You don’t need any configuration files to run it
  • The default service type iscrawlerFor complete configuration instructionsConfig

Configure the reference

# crawler configuration
[services.crawler.spider]
# the crawler name
Name = 'a_spider'
Time to submit the initialization seed
SubmitInitialSeedOpportunity = 'start'
# Whether to automatically manage cookies
AutoCookie = false

# Frame configuration
[services.crawler.frame]
The request timed out in milliseconds
RequestTimeout = 20000
# maximum number of attempts
RequestMaxAttemptCount = 5
Copy the code

Persistent queue

Use Redis as the queue

[services.crawler.queue]
type = 'redis'      Use redis as the queue, default is memory
Address = '127.0.0.1:6379' # address
UserName = ' '       # Username, optional
Password = ' '       # Password, optional
DB = 0              # db, only non-cluster valid, optional, default 0
IsCluster = false   # Cluster or not: Optional. Default is false
MinIdleConns = 1    Minimum number of free connections. Optional. Default: 1
PoolSize = 1        # Client pool size, optional, default 1
ReadTimeout = 5000  Read timeout (ms, optional, default 5000
WriteTimeout = 5000 Write timeout (ms, optional, default 5000
DialTimeout = 5000  # Connection timeout (ms, optional, default 5000
Copy the code

Use SSDB as the queue

[services.crawler.queue]
type = 'ssdb'       # use SSDB as queue, default is memory
Address = '127.0.0.1:8888' # address
Password = ' '       # Password, optional
MinIdleConns = 1    Minimum number of free connections. Optional. Default: 1
PoolSize = 1        # Client pool size, optional, default 1
ReadTimeout = 5000  Read timeout (ms, optional, default 5000
WriteTimeout = 5000 Write timeout (ms, optional, default 5000
DialTimeout = 5000  # Connection timeout (ms, optional, default 5000
Copy the code

Use persistent collections

Yes, similar to the configuration of queues

Use Redis as a collection

[services.crawler.set]
type = 'redis'      Use redis as the queue, default is memory
Address = '127.0.0.1:6379' # address
UserName = ' '       # Username, optional
Password = ' '       # Password, optional
DB = 0              # db, only non-cluster valid, optional, default 0
IsCluster = false   # Cluster or not: Optional. Default is false
MinIdleConns = 1    Minimum number of free connections. Optional. Default: 1
PoolSize = 1        # Client pool size, optional, default 1
ReadTimeout = 5000  Read timeout (ms, optional, default 5000
WriteTimeout = 5000 Write timeout (ms, optional, default 5000
DialTimeout = 5000  # Connection timeout (ms, optional, default 5000
Copy the code

Use SSDB as a collection

[services.crawler.set]
type = 'ssdb'       # use SSDB as queue, default is memory
Address = '127.0.0.1:8888' # address
Password = ' '       # Password, optional
MinIdleConns = 1    Minimum number of free connections. Optional. Default: 1
PoolSize = 1        # Client pool size, optional, default 1
ReadTimeout = 5000  Read timeout (ms, optional, default 5000
WriteTimeout = 5000 Write timeout (ms, optional, default 5000
DialTimeout = 5000  # Connection timeout (ms, optional, default 5000
Copy the code

Using the agent

[services.crawler.proxy]
type = 'static'     Static proxy supports HTTP, HTTPS, socks5, sockS5H
address = 'socks5: / / 127.0.0.1:1080'  # proxy address
User = ' '           # Username, optional
Password = ' '       # Password, optional
Copy the code

concept

seedsseed

  1. Describe the site where you want to download data
  2. How does the site request, what is the form and body of the request, cookie? , whether to automatically jump, failure how to retry
  3. What happens when you get the data
  4. When is the data captured and how often is the data captured
  5. This is theseed.seedDescribes a process from the beginning of data capture to processing
  6. seedIt’s stateless

The queuequeue

  1. We’re going to grab itseedIt’s going to put it in the queue, one at a time from the front of the queueseedStart fetching the process.
  2. If the retrieved data is a list, such as a list of articles, the handler should iterate through and submit those containing the article informationseed
  3. theseseedWill be placed before or after the queue, depending on the configuration, and then proceed to the next fetching round.
  4. Queue is the basis of distributed, parallel and stateless framework.

downloaderdownloader

  1. Use the GO built-in libraryhttpFor the request
  2. The downloader will be based onseedDescribes the automatic build request request body, request method, request form, header, cookie, etc

The middlewaredownloader

  1. Middleware includesRequest middlewareandResponsive middleware
  2. Request middlewareThe responsibility of thedownloaderTo deal withseedChecked beforeseedTo determine whether or not a request should be made.
  3. Responsive middlewareThe responsibility of thedownloaderTo deal withseedThen check the validity of the data or determine whether it should beseedHand over to handler
  4. Developers can develop their own middleware

Design ideas

Process of independent

  1. incrawlerIn the basic design of crawler, the minimum unit of crawler running is a process. A crawler may have multiple processes, and each process can run on any machine.
  2. Each process processes only one at a timeseed, each process has a separate DB connection, independentdownloaderEtc., processes do not affect each other.
  3. You don’t have to worry about how multiple processes are coordinated, just develop as a single process at development time and start multiple processes at runtime.
  4. Multiple processes require queue service support, such asredis.ssdbUsing thememoryEnabling multi-process crawlers on queues may produce unexpected results.

Request an independent

  1. Each request is independent,seedIsolated from the process, the process consumes the initial seed (the initial URL) and generates more seeds according to the processing logic and puts them in the queue, from which the process retrieves the seed for you to request and parse.
  2. The seed is repeatedly removed from the queue, processed and saved, and the seed is generated and placed in the queue until there are no seeds in the queue.

Fetching process

  1. seedWill pass before the requestRequest middlewareCheck.
  2. downloaderdownloaderDepending on theseedAutomatically downloads and writes web site dataseedIn the.
  3. After downloadingseedPass byResponsive middlewareCheck.
  4. willseedThe user then decides how to extract the data.
  5. Note: The seed will be put back to the queue to prevent seed loss if the end signal is received in the process of seed capture

Configuration change

  1. In terms of design, we should try our best to extract constants that may be changed during development and form configuration files for later adjustment.

modular

  1. The crawler request, queue, agent, downloader, configuration management and so on are abstracted into separate modules, each performing its own functions, which can be decouped and convenient for later upgrade and maintenance
  2. Users can also redesign their own logic to replace some modules according to their own needs.

Process management

In the development…

Some operations

I’ve already grabbed this data, how can I make the crawler stop grabbing it

  1. Will finish processing (the desired data has been persisted)seedThe unique identifier (usually a URL) of the
  2. Submit newseedCheck to see if the unique flag already exists in the collection. That’s it
  3. Note: This way it is still possible to crawl the same data again, as you might in thisseedThe same unique flag is submitted before the processing is completedseed

unsolicitedseed

  1. aseedIt may not be requested, butseedYou have to have a handler

Project management tool

  1. The installationgo install github.com/zly-app/crawler/tools/crawler@latest
  2. Directions for usecrawler help

The command

  1. Initialize a projectcrawler init <project_name> && cd <project_name>
  2. Create a crawlercrawler create <spider>
  3. Submit the initialization seedcrawler start <spider>
  4. Clear all queues of crawlerscrawler clean <spider>
  5. Clear crawler collection datacrawler clean_set <spider>
  6. generatesupervisorThe configuration, pointhereSupervisor Official website
    1. crawler makeAccording to the documentsconfigs/supervisor_programs.tomlsupervisor_config/conf.dDirectory to generate someinifile
    2. willsupervisorConfiguration modification includes files<project_dir>/supervisor_config/conf.d/*.ini

Scheduler tools

  1. The installationgo build github.com/zly-app/crawler/tools/crawler_scheduler
  2. willsupervisorConfiguration changes to include the following files<project_dir>/supervisor_config/scheduler_config.ini