Architecture system V1.0 based on Golang distributed crawler system
What is a distributed system
A distributed system is one in which hardware or software components are distributed on different network computers and communicate and coordinate with each other only through messaging. Simply put, it is a collection of independent computers that provide services externally, but for the users of the system, it is just like a computer providing services. Distributed means that more ordinary computers (as opposed to expensive mainframes) can be used to provide services in distributed clusters. The more computers there are, the more CPU, memory, storage resources, etc., and the more concurrent visits they can handle.
We know from the concept of distributed system, communication and coordination between different host mainly through the network, so the distributed system of computer on the space there is almost no limit, the computer may be placed on the different cabinets, could also be deployed in a different room, also may be in different cities, For large sites may even be distributed in different countries and regions.
Second, the characteristics of distributed system
Different materials introduce the characteristics of distributed system. Although there are different expressions, they are all similar with minor differences. Here, we summarize the following three characteristics for the distributed crawler to be implemented:
Multiple nodes
Fault tolerance
Scalability (performance)
Natural distribution
The messaging
Nodes have private storage
Easy to develop
Scalability (features)
Contrast: parallel computing
Fulfill specific requirements
Method of message passing:
REST
RPC
The middleware
Three, requirements description design key points
In the process of crawler development, some business scenarios need to crawl hundreds or even thousands of sites at the same time, so a framework supporting multi-crawler is needed. In the design should pay attention to the following points:
Code reuse. Modular function. Suppose you write a complete crawler for each site. That must involve a lot of back-and-forth. Not only is development inefficient. And the whole crawler project will become bloated and difficult to manage in the later stage.
Easy to extend. Multi-crawler framework, which is the most intuitive need is easy to expand. To add a target site to climb, I only need to write the necessary content (such as crawling rules, parsing rules, warehousing rules), which is the fastest and best.
Robustness and maintainability.
So many sites crawl at the same time, the probability of error is greater. Such as disconnection, being prevented from crawling, crawling into “dirty data” and so on. Therefore, it is necessary to do a good job in log monitoring, which can monitor the state of the crawler system in real time and locate the error information accurately and specifically. Also take care of all the exceptions. If you come back from vacation and your crawler has died due to a minor problem, you’ll be sorry to have wasted a few days (although I personally check the crawler status remotely from time to time).
Distributed. Multi-site fetching. The amount of data is generally large and can be distributed and extended. This is also a required feature. Distributed. Care must be taken to queue messages well. Do a good job of multi-node unified weight removal.
Crawler optimization.
That’s the big topic, but the main one. The framework should be based on asynchrony, or use coroutines + multiple processes.
Iv. Project architecture analysis
4.1 De-duplication problems
Try writing bloom filters to implement change requirements more quickly
-
Question:
-
The amount of deduplicated data carried by a single node is limited
-
Unable to save previous deduplication results (because it was saved to memory (map))
-
Solution:
-
Distributed deduplication is performed based on key-value Stroe(such as Redis)
4.2 Data Storage Problems
Question:
-
The structure of the storage part, the technology stack and the crawler part are very different
-
Further optimization requires a special background in ElasticSearch technology
-
Solution:
-
Storage service
The large framework of complex distributed crawler system, when the specific implementation, there are still many details to deal with, at this time, the experience of crawler system before stepping on the pit is very important.