Distributed crawler framework services

Available for github.com/zly-app/zap… The service of

instructions

Crawler.withservice () # RegistrySpider(...) Service injection spiderCopy the code

The sample

package main

import (
   "fmt"
   "github.com/zly-app/zapp"
   "github.com/zly-app/crawler"
   "github.com/zly-app/crawler/core"
)

/ / a spiders
type Spider struct {
   core.ISpiderTool // This interface must be inherited
}

/ / initialization
func (s *Spider) Init(a) error { return nil }

// Submit the initialization seed
func (s *Spider) SubmitInitialSeed(a) error {
   seed := s.NewSeed("https://www.baidu.com/", s.Parser) // Create the seed and specify the parsing method
   s.SubmitSeed(seed)                                    // Submit the seed
   return nil
}

// Parse the method
func (s *Spider) Parser(seed *core.Seed) error {
   fmt.Println(string(seed.HttpResponseBody)) Print the response body
   return nil
}

/ / close
func (s *Spider) Close(a) error { return nil }

func main(a) {
   app := zapp.NewApp("a_spider", crawler.WithService()) // Enable the Crawler service
   crawler.RegistrySpider(new(Spider))                   / / into the spiders
   app.Run()                                             / / run
}
Copy the code

configuration

You don’t need any configuration files to run it
The default service type iscrawlerFor complete configuration instructionsConfig

Configure the reference

# crawler configuration
[services.crawler.spider]
# the crawler name
Name = 'a_spider'
Time to submit the initialization seed
SubmitInitialSeedOpportunity = 'start'
# Whether to automatically manage cookies
AutoCookie = false

# Frame configuration
[services.crawler.frame]
The request timed out in milliseconds
RequestTimeout = 20000
# maximum number of attempts
RequestMaxAttemptCount = 5
Copy the code

Persistent queue

Use Redis as the queue

[services.crawler.queue]
type = 'redis'      Use redis as the queue, default is memory
Address = '127.0.0.1:6379' # address
UserName = ' '       # Username, optional
Password = ' '       # Password, optional
DB = 0              # db, only non-cluster valid, optional, default 0
IsCluster = false   # Cluster or not: Optional. Default is false
MinIdleConns = 1    Minimum number of free connections. Optional. Default: 1
PoolSize = 1        # Client pool size, optional, default 1
ReadTimeout = 5000  Read timeout (ms, optional, default 5000
WriteTimeout = 5000 Write timeout (ms, optional, default 5000
DialTimeout = 5000  # Connection timeout (ms, optional, default 5000
Copy the code

Use SSDB as the queue

[services.crawler.queue]
type = 'ssdb'       # use SSDB as queue, default is memory
Address = '127.0.0.1:8888' # address
Password = ' '       # Password, optional
MinIdleConns = 1    Minimum number of free connections. Optional. Default: 1
PoolSize = 1        # Client pool size, optional, default 1
ReadTimeout = 5000  Read timeout (ms, optional, default 5000
WriteTimeout = 5000 Write timeout (ms, optional, default 5000
DialTimeout = 5000  # Connection timeout (ms, optional, default 5000
Copy the code

Use persistent collections

Yes, similar to the configuration of queues

Use Redis as a collection

[services.crawler.set]
type = 'redis'      Use redis as the queue, default is memory
Address = '127.0.0.1:6379' # address
UserName = ' '       # Username, optional
Password = ' '       # Password, optional
DB = 0              # db, only non-cluster valid, optional, default 0
IsCluster = false   # Cluster or not: Optional. Default is false
MinIdleConns = 1    Minimum number of free connections. Optional. Default: 1
PoolSize = 1        # Client pool size, optional, default 1
ReadTimeout = 5000  Read timeout (ms, optional, default 5000
WriteTimeout = 5000 Write timeout (ms, optional, default 5000
DialTimeout = 5000  # Connection timeout (ms, optional, default 5000
Copy the code

Use SSDB as a collection

[services.crawler.set]
type = 'ssdb'       # use SSDB as queue, default is memory
Address = '127.0.0.1:8888' # address
Password = ' '       # Password, optional
MinIdleConns = 1    Minimum number of free connections. Optional. Default: 1
PoolSize = 1        # Client pool size, optional, default 1
ReadTimeout = 5000  Read timeout (ms, optional, default 5000
WriteTimeout = 5000 Write timeout (ms, optional, default 5000
DialTimeout = 5000  # Connection timeout (ms, optional, default 5000
Copy the code

Using the agent

[services.crawler.proxy]
type = 'static'     Static proxy supports HTTP, HTTPS, socks5, sockS5H
address = 'socks5: / / 127.0.0.1:1080'  # proxy address
User = ' '           # Username, optional
Password = ' '       # Password, optional
Copy the code

concept

seeds`seed`

Describe the site where you want to download data
How does the site request, what is the form and body of the request, cookie? , whether to automatically jump, failure how to retry
What happens when you get the data
When is the data captured and how often is the data captured
This is theseed.seedDescribes a process from the beginning of data capture to processing
seedIt’s stateless

The queue`queue`

We’re going to grab itseedIt’s going to put it in the queue, one at a time from the front of the queueseedStart fetching the process.
If the retrieved data is a list, such as a list of articles, the handler should iterate through and submit those containing the article informationseed
theseseedWill be placed before or after the queue, depending on the configuration, and then proceed to the next fetching round.
Queue is the basis of distributed, parallel and stateless framework.

downloader`downloader`

Use the GO built-in libraryhttpFor the request
The downloader will be based onseedDescribes the automatic build request request body, request method, request form, header, cookie, etc

The middleware`downloader`

Middleware includesRequest middlewareandResponsive middleware
Request middlewareThe responsibility of thedownloaderTo deal withseedChecked beforeseedTo determine whether or not a request should be made.
Responsive middlewareThe responsibility of thedownloaderTo deal withseedThen check the validity of the data or determine whether it should beseedHand over to handler
Developers can develop their own middleware

Design ideas

Process of independent

incrawlerIn the basic design of crawler, the minimum unit of crawler running is a process. A crawler may have multiple processes, and each process can run on any machine.
Each process processes only one at a timeseed, each process has a separate DB connection, independentdownloaderEtc., processes do not affect each other.
You don’t have to worry about how multiple processes are coordinated, just develop as a single process at development time and start multiple processes at runtime.
Multiple processes require queue service support, such asredis.ssdbUsing thememoryEnabling multi-process crawlers on queues may produce unexpected results.

Request an independent

Each request is independent,seedIsolated from the process, the process consumes the initial seed (the initial URL) and generates more seeds according to the processing logic and puts them in the queue, from which the process retrieves the seed for you to request and parse.
The seed is repeatedly removed from the queue, processed and saved, and the seed is generated and placed in the queue until there are no seeds in the queue.

Fetching process

seedWill pass before the requestRequest middlewareCheck.
downloaderdownloaderDepending on theseedAutomatically downloads and writes web site dataseedIn the.
After downloadingseedPass byResponsive middlewareCheck.
willseedThe user then decides how to extract the data.
Note: The seed will be put back to the queue to prevent seed loss if the end signal is received in the process of seed capture

Configuration change

In terms of design, we should try our best to extract constants that may be changed during development and form configuration files for later adjustment.

modular

The crawler request, queue, agent, downloader, configuration management and so on are abstracted into separate modules, each performing its own functions, which can be decouped and convenient for later upgrade and maintenance
Users can also redesign their own logic to replace some modules according to their own needs.

Process management

In the development…

Some operations

I’ve already grabbed this data, how can I make the crawler stop grabbing it

Will finish processing (the desired data has been persisted)seedThe unique identifier (usually a URL) of the
Submit newseedCheck to see if the unique flag already exists in the collection. That’s it
Note: This way it is still possible to crawl the same data again, as you might in thisseedThe same unique flag is submitted before the processing is completedseed

unsolicited`seed`

aseedIt may not be requested, butseedYou have to have a handler

Project management tool

The installationgo install github.com/zly-app/crawler/tools/crawler@latest
Directions for usecrawler help

The command

Initialize a projectcrawler init <project_name> && cd <project_name>
Create a crawlercrawler create <spider>
Submit the initialization seedcrawler start <spider>
Clear all queues of crawlerscrawler clean <spider>
Clear crawler collection datacrawler clean_set <spider>
generatesupervisorThe configuration, pointhereSupervisor Official website
1. crawler makeAccording to the documentsconfigs/supervisor_programs.toml 在 supervisor_config/conf.dDirectory to generate someinifile
2. willsupervisorConfiguration modification includes files<project_dir>/supervisor_config/conf.d/*.ini

Scheduler tools

The installationgo build github.com/zly-app/crawler/tools/crawler_scheduler
willsupervisorConfiguration changes to include the following files<project_dir>/supervisor_config/scheduler_config.ini

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Golang Distributed crawler framework service

Distributed crawler framework services

instructions

The sample

configuration

Configure the reference

Persistent queue

Use Redis as the queue

Use SSDB as the queue

Use persistent collections

Use Redis as a collection

Use SSDB as a collection

Using the agent

concept

seeds`seed`

The queue`queue`

downloader`downloader`

The middleware`downloader`

Design ideas

Process of independent

Request an independent

Fetching process

Configuration change

modular

Process management

Some operations

I’ve already grabbed this data, how can I make the crawler stop grabbing it

unsolicited`seed`

Project management tool

The command

Scheduler tools

Golang Distributed crawler framework service

Distributed crawler framework services

instructions

The sample

configuration

Configure the reference

Persistent queue

Use Redis as the queue

Use SSDB as the queue

Use persistent collections

Use Redis as a collection

Use SSDB as a collection

Using the agent

concept

seedsseed

The queuequeue

downloaderdownloader

The middlewaredownloader

Design ideas

Process of independent

Request an independent

Fetching process

Configuration change

modular

Process management

Some operations

I’ve already grabbed this data, how can I make the crawler stop grabbing it

unsolicitedseed

Project management tool

The command

Scheduler tools

Related Posts

Learning HTTP through TCP/IP (4)

August Go design patterns – | template method more challenges

Windows build K8S (WSL2+Docker)

seeds`seed`

The queue`queue`

downloader`downloader`

The middleware`downloader`

unsolicited`seed`