preface
Due to the business needs of the company, we need to obtain the historical articles of wechat public accounts provided by customers and update them every day. Obviously, more than 300 public accounts cannot be checked every day by manual. The problem was submitted to the IT team. I am sure that I would like to recommend him who loves crawlers. I used to be a crawler of sogou wechat before, but later I have been working on Java Web. This project rekindled my love for crawlers, and I used spring Cloud architecture to do crawlers for the first time. Next, I will share the project experience through a series of articles, and provide the source code for you to correct!
I. Introduction to the system
This system is based on Java development, through simple configuration of the public number name or wechat signal, to achieve timing or real-time capture of wechat public number articles (including reading, like, watching).
Second, system architecture
The technical architecture
Spring Cloud, SpringBoot, Mybatis-Plus, Nacos, RocketMq, Nginx
storage
Mysql, MongoDB, Redis, Solr
The cache
Redis
The agent
Fiddler
Three, the advantages and disadvantages of the system
System advantages
1. After the public number is configured, it can realize automatic capture through JS injection function of Fiddler and Websocket; 2. The system is distributed architecture with high availability; 3. The RocketMq message queue is decoupled to solve the problem of collection failure caused by network jitter. If the consumption fails for three times, the log will be recorded to mysql to ensure the integrity of the article; 4. Any number of micro signals can be added to improve acquisition efficiency and resist anti-crawl restrictions; 5, Redis cache each wechat signal within 24 hours of collection records, to prevent sealing; 6. Nacos, as the configuration center, can adjust the acquisition frequency in real time through thermal configuration; 7. Store the collected data to Solr cluster to improve the retrieval speed; 8. Store the records returned from captured packets to the MongoDB archive for viewing error logs.
System disadvantages:
1, through the real machine real number collection of information, if you need to collect a large number of public numbers, you need to have a number of wechat signals as support (if the account on the day of the limit, you can climb the wechat public platform interface to get the message); 2. The collection time is set by the system, and the message has a certain lag (if the number of public accounts is not enough, the number of wechat signals can be optimized by increasing the collection frequency).
Iv. Introduction to modules
Some of the functionality was wrapped up ahead of time because management systems and API calls were added later.
common-ws-starter
Public module: Holds public messages such as utility classes and entity classes.
redis-ws-starter
Redis module: the secondary encapsulation of spring-boot-starter-data-Redis, exposing the encapsulated Redis and Redisson tools.
rocketmq-ws-starter
RocketMq module: secondary encapsulation of the RocketMQ-spring-boot-starter, which provides consumption retry and failure logging functions.
db-ws-starter
Mysql data source module: encapsulates mysql data sources, supports multiple data sources, and implements dynamic switching of data sources by custom annotations.
sql-wx-spider
Mysql database module: provides all functions for mysql database operations.
pc-wx-spider
PC side collection module: includes PC side collection of public number history messages related functions.
java-wx-spider
Java extract module: contains Java program extract article content related functions.
mobile-wx-spider
Simulator acquisition module: contains the function of collecting messages through the simulator or mobile phone.
5. General flow chart
6. Running screenshots
PC and mobile
The console
End of the run
conclusion
The pro test of the project is now in operation, and has solved the problem of wechat sogou temporary link to permanent link in the development of the project, hoping to help the old tie who is troubled by similar business. Now do Java such as a boat against the current, not forward is back, I do not know when was rolled in, wish everyone has a sunflower treasure book of their own, see this is not praise plus attention to go a wave.
Directly attached to the Java backend source: WS-Spider