preface
Due to the business needs of the company, the historical articles of WeChat official accounts provided by customers need to be obtained and updated every day. Obviously, more than 300 official accounts cannot be checked every day manually, so the problem was submitted to the IT team. I have been working on WeChat crawler of sogou before, and I have been working on Java Web since then. This project rekindled my love for crawler. It was the first time to use Spring Cloud architecture to do crawler, which lasted more than 20 days, and finally finished. Next, I will share my experience of the project in a series of articles, and present the source code for you to correct!
A brief introduction to the system
This system is based on JAVA development, through a simple configuration of the public number name or WeChat ID, to achieve the timing or instant grab WeChat public number articles (including reading, thumb up, in the look).
Second, system architecture
The technical architecture
Spring Cloud, SpringBoot, MyBatis Plus, Nacos, RocketMQ, Nginx
storage
MySQL, MongoDB, Redis, Solr
The cache
Redis
The agent
Fiddler
Three, the advantages and disadvantages of the system
System advantages
1, after the configuration of the public number can be through Fiddler’s JS injection function and WebSocket to achieve automatic capture; 2. The system is of distributed architecture with high availability; 3. RocketMQ message queue is decoupled, which can solve the failure of collection caused by network jamming. If the consumption fails for three times, the log will be recorded to MySQL to ensure the integrity of the article; 4. Any number of WeChat IDs can be added to improve collection efficiency and resist anti-climbing restrictions; 5. Redis caches the collection records of each WeChat ID within 24 hours to prevent the number closure; 6. As the configuration center, NACOS can adjust the acquisition frequency in real time through thermal configuration; 7. Store the collected data to the Solr cluster to improve the retrieval speed; 8. Store the records returned by the capture package into the MongoDB archive for easy viewing of the error log.
System disadvantages:
1. Collect information through real machine and real number. If you need to collect a large number of public accounts, you need to have multiple WeChat ID as support (if the account is limited on the same day, you can access the information by climbing the WeChat public platform interface); 2, not a public number can be captured immediately, the collection time is set by the system, the message has a certain lag (if the number of public number is not enough WeChat ID can be optimized by improving the collection frequency).
IV. Introduction to the module
Due to the management system and API call functions to be added later, some functions are encapsulated in advance.
common-ws-starter
Public module: holds public messages such as utility classes and entity classes.
redis-ws-starter
Redis module: a secondary package of spring-boot-starter-data-redis, which exposes the encapsulated Redis tool class and Reddisson tool class.
rocketmq-ws-starter
RocketMQ module: A secondary package of RocketMQ-spring-boot-starter that provides consumer retry and failure logging.
db-ws-starter
MySQL data source module: It encapsulates MySQL data source, supports multiple data sources, and realizes dynamic switching of data sources with custom annotations.
sql-wx-spider
MySQL Database Module: Provides all functions for MySQL database operations.
pc-wx-spider
PC terminal collection module: including PC terminal collection of public history message related functions.
java-wx-spider
Java extraction module: contains Java programs to extract article content related functions.
mobile-wx-spider
Simulator acquisition module: it includes functions related to the interaction volume of collecting messages through the simulator or mobile phone terminal.
Five, the general flow chart
6. Run screenshots
PC and mobile
The console
End of the run
conclusion
The project pro test is now in operation, and the sogou temporary link to permanent link problem has been solved in the project development. I hope it will be helpful to brother who are plagued by similar services. Now do Java such as rowing upstream, not to advance is to retreat, I do not know when it was rolled in, wish everyone has a book of their own sunflower treasure, see this also do not give a support. Attach the Java backend source directly: WS-Spider